Repository: blaze/castra
Branch: master
Commit: 1ae53dfcafdd
Files: 10
Total size: 44.4 KB

Directory structure:
gitextract_9pxf5njk/

├── .gitignore
├── .travis.yml
├── LICENSE
├── MANIFEST.in
├── README.rst
├── castra/
│   ├── __init__.py
│   ├── core.py
│   └── tests/
│       └── test_core.py
├── requirements.txt
└── setup.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
*.pyc


================================================
FILE: .travis.yml
================================================
sudo: False

language: python

matrix:
  include:
    - python: 2.7
    - python: 3.3
    - python: 3.4
    - python: 3.5

install:
  # Install conda
  - wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh
  - bash miniconda.sh -b -p $HOME/miniconda
  - export PATH="$HOME/miniconda/bin:$PATH"
  - conda config --set always_yes yes --set changeps1 no
  - conda update conda

  # Install dependencies
  - conda create -n castra python=$TRAVIS_PYTHON_VERSION pytest numpy pandas dask
  - source activate castra
  - pip install blosc
  - pip install bloscpack
  - pip install dask --upgrade

script:
  - py.test -x --doctest-modules --pyargs castra

notifications:
  email: false


================================================
FILE: LICENSE
================================================
﻿Copyright (c) 2015, Continuum Analytics, Inc.
Copyright (c) 2015, Valentin Haenel <valentin@haenel.co>
All rights reserved.

Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.

Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

Neither the name of Continuum Analytics nor the names of any contributors
may be used to endorse or promote products derived from this software
without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
THE POSSIBILITY OF SUCH DAMAGE.


================================================
FILE: MANIFEST.in
================================================
recursive-include castra *.py
recursive-include docs *.rst

include setup.py
include README.rst
include LICENSE.txt
include requirements.txt
include MANIFEST.in

prune docs/_build


================================================
FILE: README.rst
================================================
Castra
======

|Build Status|

Castra is an on-disk, partitioned, compressed, column store.
Castra provides efficient columnar range queries.

*  **Efficient on-disk:**  Castra stores data on your hard drive in a way that you can load it quickly, increasing the comfort of inconveniently large data.
*  **Partitioned:**  Castra partitions your data along an index, allowing rapid loads of ranges of data like "All records between January and March"
*  **Compressed:**  Castra uses Blosc_ to compress data, increasing effective disk bandwidth and decreasing storage costs
*  **Column-store:**  Castra stores columns separately, drastically reducing I/O costs for analytic queries
*  **Tabular data:**  Castra plays well with Pandas and is an ideal fit for append-only applications like time-series

Maintenance
-----------

This project is no longer actively maintained.  Use at your own risk.

Example
-------

Consider some Pandas DataFrames

.. code-block:: python

   In [1]: import pandas as pd
   In [2]: A = pd.DataFrame({'price': [10.0, 11.0], 'volume': [100, 200]},
      ...:                  index=pd.DatetimeIndex(['2010', '2011']))

   In [3]: B = pd.DataFrame({'price': [12.0, 13.0], 'volume': [300, 400]},
      ...:                  index=pd.DatetimeIndex(['2012', '2013']))

We create a Castra with a filename and a template dataframe from which to get
column name, index, and dtype information

.. code-block:: python

   In [4]: from castra import Castra
   In [5]: c = Castra('data.castra', template=A)

The castra starts empty but we can extend it with new dataframes:

.. code-block:: python

   In [6]: c.extend(A)

   In [7]: c[:]
   Out[7]:
               price  volume
   2010-01-01     10     100
   2011-01-01     11     200

   In [8]: c.extend(B)

   In [9]: c[:]
   Out[9]:
               price  volume
   2010-01-01     10     100
   2011-01-01     11     200
   2012-01-01     12     300
   2013-01-01     13     400

We can select particular columns

.. code-block:: python

   In [10]: c[:, 'price']
   Out[10]:
   2010-01-01    10
   2011-01-01    11
   2012-01-01    12
   2013-01-01    13
   Name: price, dtype: float64

Particular ranges

.. code-block:: python

   In [12]: c['2011':'2013']
   Out[12]:
               price  volume
   2011-01-01     11     200
   2012-01-01     12     300
   2013-01-01     13     400

Or both

.. code-block:: python

   In [13]: c['2011':'2013', 'volume']
   Out[13]:
   2011-01-01    200
   2012-01-01    300
   2013-01-01    400
   Name: volume, dtype: int64

Storage
-------

Castra stores your dataframes as they arrived, you can see the divisions along
which you data is divided.

.. code-block:: python

   In [14]: c.partitions
   Out[14]:
   2011-01-01    2009-12-31T16:00:00.000000000-0800--2010-12-31...
   2013-01-01    2011-12-31T16:00:00.000000000-0800--2012-12-31...
   dtype: object

Each column in each partition lives in a separate compressed file::

   $ ls -a data.castra/2011-12-31T16:00:00.000000000-0800--2012-12-31T16:00:00.000000000-0800
   .  ..  .index  price  volume

Restrictions
------------

Castra is both fast and restrictive.

*  You must always give it dataframes that match its template (same column
   names, index type, dtypes).
*  You can only give castra dataframes with **increasing index values**.  For
   example you can give it one dataframe a day for values on that day.  You can
   not go back and update previous days.

Text and Categoricals
---------------------

Castra tries to encode text and object dtype columns with
msgpack_, using the implementation found in
the Pandas library.  It falls back to `pickle` with a high protocol if that
fails.

Alternatively, Castra can categorize your data as it receives it

.. code-block:: python

   >>> c = Castra('data.castra', template=df, categories=['list', 'of', 'columns'])

   or

   >>> c = Castra('data.castra', template=df, categories=True) # all object dtype columns

Categorizing columns that have repetitive text, like ``'sex'`` or
``'ticker-symbol'`` can greatly improve both read times and computational
performance with Pandas.  See this blogpost_ for more information.

.. _msgpack: http://msgpack.org/index.html


Dask dataframe
--------------

Castra interoperates smoothly with dask.dataframe_

.. code-block:: python

   >>> import dask.dataframe as dd
   >>> df = dd.read_csv('myfiles.*.csv')
   >>> df.set_index('timestamp', compute=False).to_castra('myfile.castra', categories=True)

   >>> df = dd.from_castra('myfile.castra')

Work in Progress
----------------

Castra is immature and largely for experimental use.

The developers do not promise backwards compatibility with future versions.
You should treat castra as a very efficient temporary format and archive your
data with some other system.


.. _Blosc: https://github.com/Blosc

.. _dask.dataframe: https://dask.pydata.org/en/latest/dataframe.html

.. _blogpost: http://matthewrocklin.com/blog/work/2015/06/18/Categoricals

.. |Build Status| image:: https://travis-ci.org/blaze/castra.svg
   :target: https://travis-ci.org/blaze/castra


================================================
FILE: castra/__init__.py
================================================
from .core import Castra

__version__ = '0.1.8'


================================================
FILE: castra/core.py
================================================
from collections import Iterator

import os

from os.path import exists, isdir

try:
    import cPickle as pickle
except ImportError:
    import pickle

import shutil
import tempfile
from hashlib import md5

from functools import partial

import blosc
import bloscpack

import numpy as np
import pandas as pd

from pandas import msgpack


bp_args = bloscpack.BloscpackArgs(offsets=False, checksum='None')

def blosc_args(dt):
    if np.issubdtype(dt, int):
        return bloscpack.BloscArgs(dt.itemsize, clevel=3, shuffle=True)
    if np.issubdtype(dt, np.datetime64):
        return bloscpack.BloscArgs(dt.itemsize, clevel=3, shuffle=True)
    if np.issubdtype(dt, float):
        return bloscpack.BloscArgs(dt.itemsize, clevel=1, shuffle=False)
    return None


# http://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename-in-python
import string
valid_chars = "-_%s%s" % (string.ascii_letters, string.digits)

def escape(text):
    """

    >>> escape("Hello!")  # Remove punctuation from names
    'Hello'

    >>> escape("/!.")  # completely invalid names produce hash string
    'cb6698330c63e87fc35933a0474238b0'
    """
    result = ''.join(c for c in str(text) if c in valid_chars)
    if not result:
        result = md5(str(text).encode()).hexdigest()
    return result


def mkdir(path):
    if not exists(path):
        os.makedirs(path)


class Castra(object):
    meta_fields = ['columns', 'dtypes', 'index_dtype', 'axis_names']

    def __init__(self, path=None, template=None, categories=None, readonly=False):
        self._readonly = readonly
        # check if we should create a random path
        self._explicitly_given_path = path is not None

        if not self._explicitly_given_path:
            self.path = tempfile.mkdtemp(prefix='castra-')
        else:
            self.path = path

        # either we have a meta directory
        if isdir(self.dirname('meta')):
            if template is not None:
                raise ValueError(
                    "Opening a castra with a template, yet this castra\n"
                    "already exists.  Filename: %s" % self.path)
            self.load_meta()
            self.load_partitions()
            self.load_categories()

        # or we don't, in which case we need a template
        elif template is not None:
            if self._readonly:
                ValueError("Can't create new castra in readonly mode")

            if isinstance(categories, (list, tuple)):
                if template.index.name in categories:
                    categories.remove(template.index.name)
                    categories.append('.index')
                self.categories = dict((col, []) for col in categories)
            elif categories is True:
                self.categories = dict((col, [])
                                       for col in template.columns
                                       if template.dtypes[col] == 'object')
                if isinstance(template.index, pd.CategoricalIndex):
                    self.categories['.index'] = []
            else:
                self.categories = dict()

            if self.categories:
                categories = set(self.categories)
                template_categories = set(template.dtypes.index.values)
                if categories.difference(template_categories) - set(['.index']):
                    raise ValueError('passed in categories %s are not all '
                                     'contained in template dataframe columns '
                                     '%s' % (categories, template_categories))

            template2 = _decategorize(self.categories, template)[2]

            self.columns, self.dtypes, self.index_dtype = \
                list(template2.columns), template2.dtypes, template2.index.dtype
            self.axis_names = [template2.index.name, template2.columns.name]

            # If index is a RangeIndex, use Int64Index instead
            ind_type = type(template2.index)
            try:
                if isinstance(template2.index, pd.RangeIndex):
                    ind_type = pd.Int64Index
            except AttributeError:
                pass
            self.partitions = pd.Series([], dtype='O', index=ind_type([]))
            self.minimum = None

            # check if the given path exists already and create it if it doesn't
            mkdir(self.path)

            # raise an Exception if it isn't a directory
            if not isdir(self.path):
                raise ValueError("'path': %s must be a directory")

            mkdir(self.dirname('meta', 'categories'))
            self.flush_meta()
            self.save_partitions()
        else:
            raise ValueError(
                "must specify a 'template' when creating a new Castra")

    def _empty_dataframe(self):
        data = dict((n, pd.Series([], dtype=d, name=n))
                    for (n, d) in self.dtypes.iteritems())
        index = pd.Index([], name=self.axis_names[0])
        columns = pd.Index(self.columns, name=self.axis_names[1])
        df = pd.DataFrame(data, columns=columns, index=index)
        return _categorize(self.categories, df)

    def load_meta(self, loads=pickle.loads):
        for name in self.meta_fields:
            with open(self.dirname('meta', name), 'rb') as f:
                setattr(self, name, loads(f.read()))

    def flush_meta(self, dumps=partial(pickle.dumps, protocol=2)):
        if self._readonly:
            raise IOError('File not open for writing')
        for name in self.meta_fields:
            with open(self.dirname('meta', name), 'wb') as f:
                f.write(dumps(getattr(self, name)))

    def load_partitions(self, loads=pickle.loads):
        with open(self.dirname('meta', 'plist'), 'rb') as f:
            self.partitions = loads(f.read())
        with open(self.dirname('meta', 'minimum'), 'rb') as f:
            self.minimum = loads(f.read())

    def save_partitions(self, dumps=partial(pickle.dumps, protocol=2)):
        if self._readonly:
            raise IOError('File not open for writing')
        with open(self.dirname('meta', 'minimum'), 'wb') as f:
            f.write(dumps(self.minimum))
        with open(self.dirname('meta', 'plist'), 'wb') as f:
            f.write(dumps(self.partitions))

    def append_categories(self, new, dumps=partial(pickle.dumps, protocol=2)):
        if self._readonly:
            raise IOError('File not open for writing')
        separator = b'-sep-'
        for col, cat in new.items():
            if cat:
                with open(self.dirname('meta', 'categories', col), 'ab') as f:
                    f.write(separator.join(map(dumps, cat)))
                    f.write(separator)

    def load_categories(self, loads=pickle.loads):
        separator = b'-sep-'
        self.categories = dict()
        for col in list(self.columns) + ['.index']:
            fn = self.dirname('meta', 'categories', col)
            if os.path.exists(fn):
                with open(fn, 'rb') as f:
                    text = f.read()
                self.categories[col] = [loads(x)
                                        for x in text.split(separator)[:-1]]

    def extend(self, df):
        if self._readonly:
            raise IOError('File not open for writing')
        if len(df) == 0:
            return
        # TODO: Ensure that df is consistent with existing data
        if not df.index.is_monotonic_increasing:
            df = df.sort_index(inplace=False)

        new_categories, self.categories, df = _decategorize(self.categories,
                                                            df)
        self.append_categories(new_categories)

        if len(self.partitions) and df.index[0] <= self.partitions.index[-1]:
            if is_trivial_index(df.index):
                df = df.copy()
                start = self.partitions.index[-1] + 1
                new_index = pd.Index(np.arange(start, start + len(df)),
                                     name = df.index.name)
                df.index = new_index
            else:
                raise ValueError("Index of new dataframe less than known data")

        index = df.index.values
        partition_name = '--'.join([escape(index.min()), escape(index.max())])

        mkdir(self.dirname(partition_name))

        # Store columns
        for col in df.columns:
            pack_file(df[col].values, self.dirname(partition_name, col))

        # Store index
        fn = self.dirname(partition_name, '.index')
        bloscpack.pack_ndarray_file(index, fn, bloscpack_args=bp_args,
                                    blosc_args=blosc_args(index.dtype))

        if not len(self.partitions):
            self.minimum = coerce_index(index.dtype, index.min())
        self.partitions.loc[index.max()] = partition_name
        self.flush()

    def extend_sequence(self, seq, freq=None):
        """Add dataframes from an iterable, optionally repartitioning by freq.

        Parameters
        ----------
        seq : iterable
            An iterable of dataframes
        freq : frequency, optional
            A pandas datetime offset. If provided, the dataframes will be
            partitioned by this frequency.
        """
        if self._readonly:
            raise IOError('File not open for writing')
        if isinstance(freq, str):
            freq = pd.datetools.to_offset(freq)
            partitioner = lambda buf, df: partitionby_freq(freq, buf, df)
        elif freq is None:
            partitioner = partitionby_none
        else:
            raise ValueError("Invalid 'freq': {0}".format(repr(freq)))
        buf = self._empty_dataframe()
        for df in seq:
            write, buf = partitioner(buf, df)
            for frame in write:
                self.extend(frame)
        if buf is not None and not buf.empty:
            self.extend(buf)

    def dirname(self, *args):
        return os.path.join(self.path, *list(map(escape, args)))

    def load_partition(self, name, columns, categorize=True):
        if isinstance(columns, Iterator):
            columns = list(columns)
        if '.index' in self.categories and name in self.partitions.index:
            name = self.categories['.index'].index(name) - 1
        if not isinstance(columns, list):
            df = self.load_partition(name, [columns], categorize=categorize)
            return df.iloc[:, 0]
        arrays = [unpack_file(self.dirname(name, col)) for col in columns]

        df = pd.DataFrame(dict(zip(columns, arrays)),
                          columns=pd.Index(columns, name=self.axis_names[1],
                                           tupleize_cols=False),
                          index=self.load_index(name))
        if categorize:
            df = _categorize(self.categories, df)
        return df

    def load_index(self, name):
        return pd.Index(unpack_file(self.dirname(name, '.index')),
                        dtype=self.index_dtype,
                        name=self.axis_names[0],
                        tupleize_cols=False)

    def __getitem__(self, key):
        if isinstance(key, tuple):
            key, columns = key
        else:
            columns = self.columns
        if isinstance(columns, slice):
            columns = self.columns[columns]

        if isinstance(key, slice):
            start, stop = key.start, key.stop
        else:
            start, stop = key, key

        if '.index' in self.categories:
            if start is not None:
                start = self.categories['.index'].index(start)
            if stop is not None:
                stop = self.categories['.index'].index(stop)
        key = slice(start, stop)

        names = select_partitions(self.partitions, key)

        if not names:
            return self._empty_dataframe()[columns]

        data_frames = [self.load_partition(name, columns, categorize=False)
                       for name in names]

        data_frames[0] = data_frames[0].loc[start:]
        data_frames[-1] = data_frames[-1].loc[:stop]
        df = pd.concat(data_frames)
        df = _categorize(self.categories, df)
        return df

    def drop(self):
        if self._readonly:
            raise IOError('File not open for writing')
        if os.path.exists(self.path):
            shutil.rmtree(self.path)

    def flush(self):
        if self._readonly:
            raise IOError('File not open for writing')
        self.save_partitions()

    def __enter__(self):
        return self

    def __exit__(self, *args):
        if not self._explicitly_given_path:
            self.drop()
        elif not self._readonly:
            self.flush()

    __del__ = __exit__

    def __getstate__(self):
        if not self._readonly:
            self.flush()
        return (self.path, self._explicitly_given_path, self._readonly)

    def __setstate__(self, state):
        self.path = state[0]
        self._explicitly_given_path = state[1]
        self._readonly = state[2]
        self.load_meta()
        self.load_partitions()
        self.load_categories()

    def to_dask(self, columns=None):
        import dask.dataframe as dd

        meta = self._empty_dataframe()
        if columns is None:
            columns = self.columns
        else:
            meta = meta[columns]

        token = md5(str((self.path, os.path.getmtime(self.path))).encode()).hexdigest()
        name = 'from-castra-' + token

        divisions = [self.minimum] + self.partitions.index.tolist()
        if '.index' in self.categories:
            divisions = ([self.categories['.index'][0]]
                       + [self.categories['.index'][d + 1] for d in divisions[1:-1]]
                       + [self.categories['.index'][-1]])

        key_parts = list(enumerate(self.partitions.values))

        dsk = dict(((name, i), (Castra.load_partition, self, part, columns))
                   for i, part in key_parts)
        if isinstance(columns, list):
            return dd.DataFrame(dsk, name, meta, divisions)
        else:
            return dd.Series(dsk, name, meta, divisions)


def pack_file(x, fn, encoding='utf8'):
    """ Pack numpy array into filename

    Supports binary data with bloscpack and text data with msgpack+blosc

    >>> pack_file(np.array([1, 2, 3]), 'foo.blp')  # doctest: +SKIP

    See also:
        unpack_file
    """
    if x.dtype != 'O':
        bloscpack.pack_ndarray_file(x, fn, bloscpack_args=bp_args,
                blosc_args=blosc_args(x.dtype))
    else:
        bytes = blosc.compress(msgpack.packb(x.tolist(), encoding=encoding), 1)
        with open(fn, 'wb') as f:
            f.write(bytes)


def unpack_file(fn, encoding='utf8'):
    """ Unpack numpy array from filename

    Supports binary data with bloscpack and text data with msgpack+blosc

    >>> unpack_file('foo.blp')  # doctest: +SKIP
    array([1, 2, 3])

    See also:
        pack_file
    """
    try:
        return bloscpack.unpack_ndarray_file(fn)
    except ValueError:
        with open(fn, 'rb') as f:
            data = msgpack.unpackb(blosc.decompress(f.read()),
                                   encoding=encoding)
            return np.array(data, object, copy=False)


def coerce_index(dt, o):
    if np.issubdtype(dt, np.datetime64):
        return pd.Timestamp(o)
    return o


def select_partitions(partitions, key):
    """ Select partitions from partition list given slice

    >>> p = pd.Series(['a', 'b', 'c', 'd', 'e'], index=[0, 10, 20, 30, 40])
    >>> select_partitions(p, slice(3, 25))
    ['b', 'c', 'd']
    """
    assert key.step is None, 'step must be None but was %s' % key.step
    start, stop = key.start, key.stop
    if start is not None:
        start = coerce_index(partitions.index.dtype, start)
        istart = partitions.index.searchsorted(start)
    else:
        istart = 0
    if stop is not None:
        stop = coerce_index(partitions.index.dtype, stop)
        istop = partitions.index.searchsorted(stop)
    else:
        istop = len(partitions) - 1

    names = partitions.iloc[istart: istop + 1].values.tolist()
    return names


def _decategorize(categories, df):
    """ Strip object dtypes from dataframe, update categories

    Given a DataFrame

    >>> df = pd.DataFrame({'x': [1, 2, 3], 'y': ['C', 'B', 'B']})

    And a dict of known categories

    >>> _ = categories = {'y': ['A', 'B']}

    Update dict and dataframe in place

    >>> extra, categories, df = _decategorize(categories, df)
    >>> extra
    {'y': ['C']}
    >>> categories
    {'y': ['A', 'B', 'C']}
    >>> df
       x  y
    0  1  2
    1  2  1
    2  3  1
    """
    extra = dict()
    new_categories = dict()
    new_columns = dict((col, df[col].values) for col in df.columns)
    for col, cat in categories.items():
        if col == '.index' or col not in df.columns:
            continue
        idx = pd.Index(df[col])
        idx = getattr(idx, 'categories', idx)
        ex = idx[~idx.isin(cat)].unique()
        if any(pd.isnull(c) for c in cat):
            ex = ex[~pd.isnull(ex)]
        extra[col] = ex.tolist()
        new_categories[col] = cat + extra[col]
        new_columns[col] = pd.Categorical(df[col].values, new_categories[col]).codes

    if '.index' in categories:
        idx = df.index
        idx = getattr(idx, 'categories', idx)
        ex = idx[~idx.isin(cat)].unique()
        if any(pd.isnull(c) for c in cat):
            ex = ex[~pd.isnull(ex)]
        extra['.index'] = ex.tolist()
        new_categories['.index'] = cat + extra['.index']

        new_index = pd.Categorical(df.index, new_categories['.index']).codes
        new_index = pd.Index(new_index, name=df.index.name)
    else:
        new_index = df.index

    new_df = pd.DataFrame(new_columns, columns=df.columns, index=new_index)
    return extra, new_categories, new_df


def make_categorical(s, categories):
    name = '.index' if isinstance(s, pd.Index) else s.name
    if name in categories:
        idx = pd.Index(categories[name], tupleize_cols=False, dtype='object')
        idx.is_unique = True
        cat = pd.Categorical(s.values, categories=idx, fastpath=True, ordered=False)
        return pd.CategoricalIndex(cat, name=s.name, ordered=True) if name == '.index' else cat
    return s if name == '.index' else s.values


def _categorize(categories, df):
    """ Categorize columns in dataframe

    >>> df = pd.DataFrame({'x': [1, 2, 3], 'y': [0, 2, 0]})
    >>> categories = {'y': ['A', 'B', 'c']}
    >>> _categorize(categories, df)
       x  y
    0  1  A
    1  2  c
    2  3  A
    """
    if isinstance(df, pd.Series):
        return pd.Series(make_categorical(df, categories),
                         index=make_categorical(df.index, categories),
                         name=df.name)
    else:
        return pd.DataFrame(dict((col, make_categorical(df[col], categories))
                                 for col in df.columns),
                            columns=df.columns,
                            index=make_categorical(df.index, categories))


def partitionby_none(buf, new):
    """Repartition to ensure partitions don't split duplicate indices"""
    if new.empty:
        return [], buf
    elif buf.empty:
        return [], new
    if not new.index.is_monotonic_increasing:
        new = new.sort_index(inplace=False)
    end = buf.index[-1]
    if end >= new.index[0] and not is_trivial_index(new.index):
        i = new.index.searchsorted(end, side='right')
        # Only need to concat, `castra.extend` will resort if needed
        buf = pd.concat([buf, new.iloc[:i]])
        new = new.iloc[i:]
    return [buf], new


def partitionby_freq(freq, buf, new):
    """Partition frames into blocks by a freq"""
    df = pd.concat([buf, new])
    if not df.index.is_monotonic_increasing:
        df = df.sort_index(inplace=False)
    start, end = pd.tseries.resample._get_range_edges(df.index[0],
                                                      df.index[-1], freq)
    inds = [df.index.searchsorted(i) for i in
            pd.date_range(start, end, freq=freq)[1:]]
    slices = [(inds[i-1], inds[i]) if i else (0, inds[i]) for i in
              range(len(inds))]
    frames = [df.iloc[i:j] for (i, j) in slices]
    return frames[:-1], frames[-1]


def is_trivial_index(ind):
    """ Is this index just 0..n ?

    If so then we can probably ignore or change it around as necessary

    >>> is_trivial_index(pd.Index([0, 1, 2]))
    True

    >>> is_trivial_index(pd.Index([0, 3, 5]))
    False
    """
    return ind[0] == 0 and (ind == np.arange(len(ind))).all()


================================================
FILE: castra/tests/test_core.py
================================================
import os
import tempfile
import pickle
import shutil

import pandas as pd
import pandas.util.testing as tm

import pytest

import numpy as np

from castra import Castra
from castra.core import mkdir, select_partitions, _decategorize, _categorize


A = pd.DataFrame({'x': [1, 2],
                  'y': [1., 2.]},
                 columns=['x', 'y'],
                 index=[1, 2])

B = pd.DataFrame({'x': [10, 20],
                  'y': [10., 20.]},
                 columns=['x', 'y'],
                 index=[10, 20])


C = pd.DataFrame({'x': [10, 20],
                  'y': [10., 20.],
                  'z': [0, 1]},
                 columns=['x', 'y', 'z']).set_index('z')
C.columns.name = 'cols'


@pytest.yield_fixture
def base():
    d = tempfile.mkdtemp(prefix='castra-')
    try:
        yield d
    finally:
        shutil.rmtree(d)


def test_safe_mkdir_with_new(base):
    path = os.path.join(base, 'db')
    mkdir(path)
    assert os.path.exists(path)
    assert os.path.isdir(path)


def test_safe_mkdir_with_existing(base):
    # an existing path should not raise an exception
    mkdir(base)


def test_create_with_random_directory():
    Castra(template=A)


def test_create_with_non_existing_path(base):
    path = os.path.join(base, 'db')
    Castra(path=path, template=A)


def test_create_with_existing_path(base):
    Castra(path=base, template=A)


def test_get_empty(base):
    df = Castra(path=base, template=A)[:]
    assert (df.columns == A.columns).all()


def test_get_empty_result(base):
    c = Castra(path=base, template=A)
    c.extend(A)

    df = c[100:200]

    assert (df.columns == A.columns).all()


def test_get_slice(base):
    c = Castra(path=base, template=A)
    c.extend(A)

    tm.assert_frame_equal(c[:], c[:, :])
    tm.assert_frame_equal(c[:, 1:], c[:][['y']])


def test_exception_with_non_dir(base):
    file_ = os.path.join(base, 'file')
    with open(file_, 'w') as f:
        f.write('file')
    with pytest.raises(ValueError):
        Castra(file_)


def test_exception_with_existing_castra_and_template(base):
    with Castra(path=base, template=A) as c:
        c.extend(A)
    with pytest.raises(ValueError):
        Castra(path=base, template=A)


def test_exception_with_empty_dir_and_no_template(base):
    with pytest.raises(ValueError):
        Castra(path=base)


def test_load(base):
    with Castra(path=base, template=A) as c:
        c.extend(A)
        c.extend(B)

    loaded = Castra(path=base)
    tm.assert_frame_equal(pd.concat([A, B]), loaded[:])


def test_del_with_random_dir():
    c = Castra(template=A)
    assert os.path.exists(c.path)
    c.__del__()
    assert not os.path.exists(c.path)


def test_context_manager_with_random_dir():
    with Castra(template=A) as c:
        assert os.path.exists(c.path)
    assert not os.path.exists(c.path)


def test_context_manager_with_specific_dir(base):
    with Castra(path=base, template=A) as c:
        assert os.path.exists(c.path)
    assert os.path.exists(c.path)


def test_timeseries():
    indices = [pd.DatetimeIndex(start=str(i), end=str(i+1), freq='w')
               for i in range(2000, 2015)]
    dfs = [pd.DataFrame({'x': list(range(len(ind)))}, ind).iloc[:-1]
           for ind in indices]

    with Castra(template=dfs[0]) as c:
        for df in dfs:
            c.extend(df)
        df = c['2010-05': '2013-02']
        assert len(df) > 100


def test_Castra():
    c = Castra(template=A)
    c.extend(A)
    c.extend(B)

    assert c.columns == ['x', 'y']

    tm.assert_frame_equal(c[0:100], pd.concat([A, B]))
    tm.assert_frame_equal(c[:5], A)
    tm.assert_frame_equal(c[5:], B)

    tm.assert_frame_equal(c[2:5], A[1:])
    tm.assert_frame_equal(c[2:15], pd.concat([A[1:], B[:1]]))


def test_pickle_Castra():
    path = tempfile.mkdtemp(prefix='castra-')
    c = Castra(path=path, template=A)
    c.extend(A)
    c.extend(B)

    dumped = pickle.dumps(c)
    undumped = pickle.loads(dumped)

    tm.assert_frame_equal(pd.concat([A, B]), undumped[:])


def test_text():
    df = pd.DataFrame({'name': ['Alice', 'Bob'],
                       'balance': [100, 200]}, columns=['name', 'balance'])
    with Castra(template=df) as c:
        c.extend(df)

        tm.assert_frame_equal(c[:], df)


def test_column_access():
    with Castra(template=A) as c:
        c.extend(A)
        c.extend(B)
        df = c[:, ['x']]

        tm.assert_frame_equal(df, pd.concat([A[['x']], B[['x']]]))

        df = c[:, 'x']
        tm.assert_series_equal(df, pd.concat([A.x, B.x]))


def test_reload():
    path = tempfile.mkdtemp(prefix='castra-')
    try:
        c = Castra(template=A, path=path)
        c.extend(A)

        d = Castra(path=path)

        assert c.columns == d.columns
        assert (c.partitions == d.partitions).all()
        assert c.minimum == d.minimum
    finally:
        shutil.rmtree(path)


def test_readonly():
    path = tempfile.mkdtemp(prefix='castra-')
    try:
        c = Castra(path=path, template=A)
        c.extend(A)
        d = Castra(path=path, readonly=True)
        with pytest.raises(IOError):
            d.extend(B)
        with pytest.raises(IOError):
            d.extend_sequence([B])
        with pytest.raises(IOError):
            d.flush()
        with pytest.raises(IOError):
            d.drop()
        with pytest.raises(IOError):
            d.save_partitions()
        with pytest.raises(IOError):
            d.flush_meta()
        assert c.columns == d.columns
        assert (c.partitions == d.partitions).all()
        assert c.minimum == d.minimum
    finally:
        shutil.rmtree(path)


def test_index_dtype_matches_template():
    with Castra(template=A) as c:
        assert c.partitions.index.dtype == A.index.dtype


def test_to_dask_dataframe():
    pytest.importorskip('dask.dataframe')

    try:
        import dask.dataframe as dd
    except ImportError:
        return

    with Castra(template=A) as c:
        c.extend(A)
        c.extend(B)

        df = c.to_dask()
        assert isinstance(df, dd.DataFrame)
        assert list(df.divisions) == [1, 2, 20]
        tm.assert_frame_equal(df.compute(), c[:])

        df = c.to_dask('x')
        assert isinstance(df, dd.Series)
        assert list(df.divisions) == [1, 2, 20]
        tm.assert_series_equal(df.compute(), c[:, 'x'])


def test_categorize():
    A = pd.DataFrame({'x': [1, 2, 3], 'y': ['A', None, 'A']},
                     columns=['x', 'y'], index=[0, 10, 20])
    B = pd.DataFrame({'x': [4, 5, 6], 'y': ['C', None, 'A']},
                     columns=['x', 'y'], index=[30, 40, 50])

    with Castra(template=A, categories=['y']) as c:
        c.extend(A)
        assert c[:].dtypes['y'] == 'category'
        assert c[:]['y'].cat.codes.dtype == np.dtype('i1')
        assert list(c[:, 'y'].cat.categories) == ['A', None]

        c.extend(B)
        assert list(c[:, 'y'].cat.categories) == ['A', None, 'C']

        assert c.load_partition(c.partitions.iloc[0], 'y').dtype == 'category'

        c.flush()

        d = Castra(path=c.path)
        tm.assert_frame_equal(c[:], d[:])


def test_save_axis_names():
    with Castra(template=C) as c:
        c.extend(C)
        assert c[:].index.name == 'z'
        assert c[:].columns.name == 'cols'
        tm.assert_frame_equal(c[:], C)


def test_same_categories_when_already_categorized():
    A = pd.DataFrame({'x': [1, 2] * 1000,
                      'y': [1., 2.] * 1000,
                      'z': np.random.choice(list('abc'), size=2000)},
                     columns=list('xyz'))
    A['z'] = A.z.astype('category')
    with Castra(template=A, categories=['z']) as c:
        c.extend(A)
        assert c.categories['z'] == A.z.cat.categories.tolist()


def test_category_dtype():
    A = pd.DataFrame({'x': [1, 2] * 3,
                      'y': [1., 2.] * 3,
                      'z': list('abcabc')},
                     columns=list('xyz'))
    with Castra(template=A, categories=['z']) as c:
        c.extend(A)
        assert A.dtypes['z'] == 'object'


def test_do_not_create_dirs_if_template_fails():
    A = pd.DataFrame({'x': [1, 2] * 3,
                      'y': [1., 2.] * 3,
                      'z': list('abcabc')},
                     columns=list('xyz'))
    with pytest.raises(ValueError):
        Castra(template=A, path='foo', categories=['w'])
    assert not os.path.exists('foo')


def test_sort_on_extend():
    df = pd.DataFrame({'x': [1, 2, 3]}, index=[3, 2, 1])
    expected = pd.DataFrame({'x': [3, 2, 1]}, index=[1, 2, 3])
    with Castra(template=df) as c:
        c.extend(df)
        tm.assert_frame_equal(c[:], expected)


def test_select_partitions():
    p = pd.Series(['a', 'b', 'c', 'd', 'e'], index=[0, 10, 20, 30, 40])
    assert select_partitions(p, slice(3, 25)) == ['b', 'c', 'd']
    assert select_partitions(p, slice(None, 25)) == ['a', 'b', 'c', 'd']
    assert select_partitions(p, slice(3, None)) == ['b', 'c', 'd', 'e']
    assert select_partitions(p, slice(None, None)) == ['a', 'b', 'c', 'd', 'e']
    assert select_partitions(p, slice(10, 30)) == ['b', 'c', 'd']


def test_first_index_is_timestamp():
    pytest.importorskip('dask.dataframe')

    df = pd.DataFrame({'x': [1, 2] * 3,
                       'y': [1., 2.] * 3,
                       'z': list('abcabc')},
                      columns=list('xyz'),
                      index=pd.date_range(start='20120101', periods=6))
    with Castra(template=df) as c:
        c.extend(df)

        assert isinstance(c.minimum, pd.Timestamp)
        assert isinstance(c.to_dask().divisions[0], pd.Timestamp)


def test_minimum_dtype():
    df = tm.makeTimeDataFrame()

    with Castra(template=df) as c:
        c.extend(df)
        assert type(c.minimum) == type(c.partitions.index[0])


def test_many_default_indexes():
    a = pd.DataFrame({'x': [1, 2, 3]})
    b = pd.DataFrame({'x': [4, 5, 6]})
    c = pd.DataFrame({'x': [7, 8, 9]})

    e = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6, 7, 8, 9]})

    with Castra(template=a) as C:
        C.extend(a)
        C.extend(b)
        C.extend(c)

        tm.assert_frame_equal(C[:], e)


def test_raise_error_on_mismatched_index():
    x = pd.DataFrame({'x': [1, 2, 3]}, index=[1, 2, 3])
    y = pd.DataFrame({'x': [1, 2, 3]}, index=[4, 5, 6])
    z = pd.DataFrame({'x': [4, 5, 6]}, index=[5, 6, 7])

    with Castra(template=x) as c:
        c.extend(x)
        c.extend(y)

        with pytest.raises(ValueError):
            c.extend(z)


def test_raise_error_on_equal_index():
    a = pd.DataFrame({'x': [1, 2, 3]}, index=[1, 2, 3])
    b = pd.DataFrame({'x': [4, 5, 6]}, index=[3, 4, 5])

    with Castra(template=a) as c:
        c.extend(a)

        with pytest.raises(ValueError):
            c.extend(b)


def test_categories_nan():
    a = pd.DataFrame({'x': ['A', np.nan]})
    b = pd.DataFrame({'x': ['B', np.nan]})

    with Castra(template=a, categories=['x']) as c:
        c.extend(a)
        c.extend(b)
        assert len(c.categories['x']) == 3


def test_extend_sequence_freq():
    df = pd.util.testing.makeTimeDataFrame(1000, 'min')
    seq = [df.iloc[i:i+100] for i in range(0,1000,100)]
    with Castra(template=df) as c:
        c.extend_sequence(seq, freq='h')
        tm.assert_frame_equal(c[:], df)
        parts = pd.date_range(start=df.index[59], freq='h',
                              periods=16).insert(17, df.index[-1])
        tm.assert_index_equal(c.partitions.index, parts)

    with Castra(template=df) as c:
        c.extend_sequence(seq, freq='d')
        tm.assert_frame_equal(c[:], df)
        assert len(c.partitions) == 1


def test_extend_sequence_none():
    data = {'a': range(5), 'b': range(5)}
    p1 = pd.DataFrame(data, index=[1, 2, 3, 4, 5])
    p2 = pd.DataFrame(data, index=[5, 5, 5, 6, 7])
    p3 = pd.DataFrame(data, index=[7, 9, 10, 11, 12])
    seq = [p1, p2, p3]
    df = pd.concat(seq)
    with Castra(template=df) as c:
        c.extend_sequence(seq)
        tm.assert_frame_equal(c[:], df)
        assert len(c.partitions) == 3
        assert len(c.load_partition('1--5', ['a', 'b']).index) == 8
        assert len(c.load_partition('6--7', ['a', 'b']).index) == 3
        assert len(c.load_partition('9--12', ['a', 'b']).index) == 4


def test_extend_sequence_overlap():
    df = pd.util.testing.makeTimeDataFrame(20, 'min')
    p1 = df.iloc[:15]
    p2 = df.iloc[10:20]
    seq = [p1,p2]
    df = pd.concat(seq)
    with Castra(template=df) as c:
        c.extend_sequence(seq)
        tm.assert_frame_equal(c[:], df.sort_index())
        assert (c.partitions.index == [p.index[-1] for p in seq]).all()
    # Check with trivial index
    p1 = pd.DataFrame({'a': range(10), 'b': range(10)})
    p2 = pd.DataFrame({'a': range(10, 17), 'b': range(10, 17)})
    seq = [p1,p2]
    df = pd.DataFrame({'a': range(17), 'b': range(17)})
    with Castra(template=df) as c:
        c.extend_sequence(seq)
        tm.assert_frame_equal(c[:], df)
        assert (c.partitions.index == [9, 16]).all()


def test_extend_sequence_single_frame():
    df = pd.util.testing.makeTimeDataFrame(100, 'h')
    seq = [df]
    with Castra(template=df) as c:
        c.extend_sequence(seq, freq='d')
        assert (c.partitions.index == ['2000-01-01 23:00:00', '2000-01-02 23:00:00',
                 '2000-01-03 23:00:00', '2000-01-04 23:00:00', '2000-01-05 03:00:00']).all()
    df = pd.DataFrame({'a': range(10), 'b': range(10)})
    seq = [df]
    with Castra(template=df) as c:
        c.extend_sequence(seq)
        tm.assert_frame_equal(c[:], df)


def test_column_with_period():
    df = pd.DataFrame({'x': [10, 20],
                       '.': [10., 20.]},
                       columns=['x', '.'],
                       index=[10, 20])

    with Castra(template=df) as c:
        c.extend(df)


def test_empty():
    with Castra(template=A) as c:
        c.extend(pd.DataFrame(columns=A.columns))
        assert len(c[:]) == 0


def test_index_with_single_value():
    df = pd.DataFrame({'x': [1, 2, 3]}, index=[1, 1, 2])
    with Castra(template=df) as c:
        c.extend(df)

        tm.assert_frame_equal(c[1], df.loc[1])


def test_categorical_index():
    df = pd.DataFrame({'x': [1, 2, 3]},
            index=pd.CategoricalIndex(['a', 'a', 'b'], ordered=True, name='foo'))

    with Castra(template=df, categories=True) as c:
        c.extend(df)
        result = c[:]
        tm.assert_frame_equal(c[:], df)

    A = pd.DataFrame({'x': [1, 2, 3]},
                    index=pd.Index(['a', 'a', 'b'], name='foo'))
    B = pd.DataFrame({'x': [4, 5, 6]},
                    index=pd.Index(['c', 'd', 'd'], name='foo'))

    path = tempfile.mkdtemp(prefix='castra-')
    try:
        with Castra(path=path, template=A, categories=['foo']) as c:
            c.extend(A)
            c.extend(B)

            c2 = Castra(path=path)
            result = c2[:]

            expected = pd.concat([A, B])
            expected.index = pd.CategoricalIndex(expected.index,
                    name=expected.index.name, ordered=True)
            tm.assert_frame_equal(result, expected)

            tm.assert_frame_equal(c['a'], expected.loc['a'])
    finally:
        shutil.rmtree(path)


def test_categorical_index_with_dask_dataframe():
    pytest.importorskip('dask.dataframe')
    import dask.dataframe as dd
    import dask

    A = pd.DataFrame({'x': [1, 2, 3, 4]},
                    index=pd.Index(['a', 'a', 'b', 'b'], name='foo'))
    B = pd.DataFrame({'x': [4, 5, 6]},
                    index=pd.Index(['c', 'd', 'd'], name='foo'))


    path = tempfile.mkdtemp(prefix='castra-')
    try:
        with Castra(path=path, template=A, categories=['foo']) as c:
            c.extend(A)
            c.extend(B)

            df = dd.from_castra(path)
            assert df.divisions == ('a', 'c', 'd')

            result = df.compute(get=dask.async.get_sync)

            expected = pd.concat([A, B])
            expected.index = pd.CategoricalIndex(expected.index,
                    name=expected.index.name, ordered=True)

            tm.assert_frame_equal(result, expected)

            tm.assert_frame_equal(df.loc['a'].compute(), expected.loc['a'])
            tm.assert_frame_equal(df.loc['b'].compute(get=dask.async.get_sync),
                                  expected.loc['b'])
    finally:
        shutil.rmtree(path)


def test__decategorize():
    df = pd.DataFrame({'x': [1, 2, 3]},
                      index=pd.CategoricalIndex(['a', 'a', 'b'], ordered=True,
                          name='foo'))

    extra, categories, df2 = _decategorize({'.index': []}, df)

    assert (df2.index == [0, 0, 1]).all()

    df3 = _categorize(categories, df2)

    tm.assert_frame_equal(df, df3)


================================================
FILE: requirements.txt
================================================
numpy
pandas
bloscpack>=0.8.0
blosc


================================================
FILE: setup.py
================================================
#!/usr/bin/env python

from os.path import exists
from setuptools import setup

setup(name='castra',
      version='0.1.8',
      description='On-disk partitioned store',
      url='http://github.com/blaze/Castra/',
      maintainer='Matthew Rocklin',
      maintainer_email='mrocklin@gmail.com',
      license='BSD',
      keywords='',
      packages=['castra'],
      package_data={'castra': ['tests/*.py']},
      install_requires=list(open('requirements.txt').read().strip().split('\n')),
      long_description=(open('README.rst').read() if exists('README.rst')
                        else ''),
      zip_safe=False)