Full Code of blaze/castra for AI

master 1ae53dfcafdd cached

10 files

44.4 KB

11.9k tokens

83 symbols

1 requests

Download .txt

Repository: blaze/castra
Branch: master
Commit: 1ae53dfcafdd
Files: 10
Total size: 44.4 KB

Directory structure:
gitextract_9pxf5njk/

├── .gitignore
├── .travis.yml
├── LICENSE
├── MANIFEST.in
├── README.rst
├── castra/
│   ├── __init__.py
│   ├── core.py
│   └── tests/
│       └── test_core.py
├── requirements.txt
└── setup.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
*.pyc


================================================
FILE: .travis.yml
================================================
sudo: False

language: python

matrix:
  include:
    - python: 2.7
    - python: 3.3
    - python: 3.4
    - python: 3.5

install:
  # Install conda
  - wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh
  - bash miniconda.sh -b -p $HOME/miniconda
  - export PATH="$HOME/miniconda/bin:$PATH"
  - conda config --set always_yes yes --set changeps1 no
  - conda update conda

  # Install dependencies
  - conda create -n castra python=$TRAVIS_PYTHON_VERSION pytest numpy pandas dask
  - source activate castra
  - pip install blosc
  - pip install bloscpack
  - pip install dask --upgrade

script:
  - py.test -x --doctest-modules --pyargs castra

notifications:
  email: false


================================================
FILE: LICENSE
================================================
Copyright (c) 2015, Continuum Analytics, Inc.
Copyright (c) 2015, Valentin Haenel <valentin@haenel.co>
All rights reserved.

Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.

Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

Neither the name of Continuum Analytics nor the names of any contributors
may be used to endorse or promote products derived from this software
without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
THE POSSIBILITY OF SUCH DAMAGE.


================================================
FILE: MANIFEST.in
================================================
recursive-include castra *.py
recursive-include docs *.rst

include setup.py
include README.rst
include LICENSE.txt
include requirements.txt
include MANIFEST.in

prune docs/_build


================================================
FILE: README.rst
================================================
Castra
======

|Build Status|

Castra is an on-disk, partitioned, compressed, column store.
Castra provides efficient columnar range queries.

*  **Efficient on-disk:**  Castra stores data on your hard drive in a way that you can load it quickly, increasing the comfort of inconveniently large data.
*  **Partitioned:**  Castra partitions your data along an index, allowing rapid loads of ranges of data like "All records between January and March"
*  **Compressed:**  Castra uses Blosc_ to compress data, increasing effective disk bandwidth and decreasing storage costs
*  **Column-store:**  Castra stores columns separately, drastically reducing I/O costs for analytic queries
*  **Tabular data:**  Castra plays well with Pandas and is an ideal fit for append-only applications like time-series

Maintenance
-----------

This project is no longer actively maintained.  Use at your own risk.

Example
-------

Consider some Pandas DataFrames

.. code-block:: python

   In [1]: import pandas as pd
   In [2]: A = pd.DataFrame({'price': [10.0, 11.0], 'volume': [100, 200]},
      ...:                  index=pd.DatetimeIndex(['2010', '2011']))

   In [3]: B = pd.DataFrame({'price': [12.0, 13.0], 'volume': [300, 400]},
      ...:                  index=pd.DatetimeIndex(['2012', '2013']))

We create a Castra with a filename and a template dataframe from which to get
column name, index, and dtype information

.. code-block:: python

   In [4]: from castra import Castra
   In [5]: c = Castra('data.castra', template=A)

The castra starts empty but we can extend it with new dataframes:

.. code-block:: python

   In [6]: c.extend(A)

   In [7]: c[:]
   Out[7]:
               price  volume
   2010-01-01     10     100
   2011-01-01     11     200

   In [8]: c.extend(B)

   In [9]: c[:]
   Out[9]:
               price  volume
   2010-01-01     10     100
   2011-01-01     11     200
   2012-01-01     12     300
   2013-01-01     13     400

We can select particular columns

.. code-block:: python

   In [10]: c[:, 'price']
   Out[10]:
   2010-01-01    10
   2011-01-01    11
   2012-01-01    12
   2013-01-01    13
   Name: price, dtype: float64

Particular ranges

.. code-block:: python

   In [12]: c['2011':'2013']
   Out[12]:
               price  volume
   2011-01-01     11     200
   2012-01-01     12     300
   2013-01-01     13     400

Or both

.. code-block:: python

   In [13]: c['2011':'2013', 'volume']
   Out[13]:
   2011-01-01    200
   2012-01-01    300
   2013-01-01    400
   Name: volume, dtype: int64

Storage
-------

Castra stores your dataframes as they arrived, you can see the divisions along
which you data is divided.

.. code-block:: python

   In [14]: c.partitions
   Out[14]:
   2011-01-01    2009-12-31T16:00:00.000000000-0800--2010-12-31...
   2013-01-01    2011-12-31T16:00:00.000000000-0800--2012-12-31...
   dtype: object

Each column in each partition lives in a separate compressed file::

   $ ls -a data.castra/2011-12-31T16:00:00.000000000-0800--2012-12-31T16:00:00.000000000-0800
   .  ..  .index  price  volume

Restrictions
------------

Castra is both fast and restrictive.

*  You must always give it dataframes that match its template (same column
   names, index type, dtypes).
*  You can only give castra dataframes with **increasing index values**.  For
   example you can give it one dataframe a day for values on that day.  You can
   not go back and update previous days.

Text and Categoricals
---------------------

Castra tries to encode text and object dtype columns with
msgpack_, using the implementation found in
the Pandas library.  It falls back to `pickle` with a high protocol if that
fails.

Alternatively, Castra can categorize your data as it receives it

.. code-block:: python

   >>> c = Castra('data.castra', template=df, categories=['list', 'of', 'columns'])

   or

   >>> c = Castra('data.castra', template=df, categories=True) # all object dtype columns

Categorizing columns that have repetitive text, like ``'sex'`` or
``'ticker-symbol'`` can greatly improve both read times and computational
performance with Pandas.  See this blogpost_ for more information.

.. _msgpack: http://msgpack.org/index.html


Dask dataframe
--------------

Castra interoperates smoothly with dask.dataframe_

.. code-block:: python

   >>> import dask.dataframe as dd
   >>> df = dd.read_csv('myfiles.*.csv')
   >>> df.set_index('timestamp', compute=False).to_castra('myfile.castra', categories=True)

   >>> df = dd.from_castra('myfile.castra')

Work in Progress
----------------

Castra is immature and largely for experimental use.

The developers do not promise backwards compatibility with future versions.
You should treat castra as a very efficient temporary format and archive your
data with some other system.



.. _Blosc: https://github.com/Blosc

.. _dask.dataframe: https://dask.pydata.org/en/latest/dataframe.html

.. _blogpost: http://matthewrocklin.com/blog/work/2015/06/18/Categoricals

.. |Build Status| image:: https://travis-ci.org/blaze/castra.svg
   :target: https://travis-ci.org/blaze/castra


================================================
FILE: castra/__init__.py
================================================
from .core import Castra

__version__ = '0.1.8'


================================================
FILE: castra/core.py
================================================
from collections import Iterator

import os

from os.path import exists, isdir

try:
    import cPickle as pickle
except ImportError:
    import pickle

import shutil
import tempfile
from hashlib import md5

from functools import partial

import blosc
import bloscpack

import numpy as np
import pandas as pd

from pandas import msgpack


bp_args = bloscpack.BloscpackArgs(offsets=False, checksum='None')

def blosc_args(dt):
    if np.issubdtype(dt, int):
        return bloscpack.BloscArgs(dt.itemsize, clevel=3, shuffle=True)
    if np.issubdtype(dt, np.datetime64):
        return bloscpack.BloscArgs(dt.itemsize, clevel=3, shuffle=True)
    if np.issubdtype(dt, float):
        return bloscpack.BloscArgs(dt.itemsize, clevel=1, shuffle=False)
    return None


# http://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename-in-python
import string
valid_chars = "-_%s%s" % (string.ascii_letters, string.digits)

def escape(text):
    """

    >>> escape("Hello!")  # Remove punctuation from names
    'Hello'

    >>> escape("/!.")  # completely invalid names produce hash string
    'cb6698330c63e87fc35933a0474238b0'
    """
    result = ''.join(c for c in str(text) if c in valid_chars)
    if not result:
        result = md5(str(text).encode()).hexdigest()
    return result


def mkdir(path):
    if not exists(path):
        os.makedirs(path)


class Castra(object):
    meta_fields = ['columns', 'dtypes', 'index_dtype', 'axis_names']

    def __init__(self, path=None, template=None, categories=None, readonly=False):
        self._readonly = readonly
        # check if we should create a random path
        self._explicitly_given_path = path is not None

        if not self._explicitly_given_path:
            self.path = tempfile.mkdtemp(prefix='castra-')
        else:
            self.path = path

        # either we have a meta directory
        if isdir(self.dirname('meta')):
            if template is not None:
                raise ValueError(
                    "Opening a castra with a template, yet this castra\n"
                    "already exists.  Filename: %s" % self.path)
            self.load_meta()
            self.load_partitions()
            self.load_categories()

        # or we don't, in which case we need a template
        elif template is not None:
            if self._readonly:
                ValueError("Can't create new castra in readonly mode")

            if isinstance(categories, (list, tuple)):
                if template.index.name in categories:
                    categories.remove(template.index.name)
                    categories.append('.index')
                self.categories = dict((col, []) for col in categories)
            elif categories is True:
                self.categories = dict((col, [])
                                       for col in template.columns
                                       if template.dtypes[col] == 'object')
                if isinstance(template.index, pd.CategoricalIndex):
                    self.categories['.index'] = []
            else:
                self.categories = dict()

            if self.categories:
                categories = set(self.categories)
                template_categories = set(template.dtypes.index.values)
                if categories.difference(template_categories) - set(['.index']):
                    raise ValueError('passed in categories %s are not all '
                                     'contained in template dataframe columns '
                                     '%s' % (categories, template_categories))

            template2 = _decategorize(self.categories, template)[2]

            self.columns, self.dtypes, self.index_dtype = \
                list(template2.columns), template2.dtypes, template2.index.dtype
            self.axis_names = [template2.index.name, template2.columns.name]

            # If index is a RangeIndex, use Int64Index instead
            ind_type = type(template2.index)
            try:
                if isinstance(template2.index, pd.RangeIndex):
                    ind_type = pd.Int64Index
            except AttributeError:
                pass
            self.partitions = pd.Series([], dtype='O', index=ind_type([]))
            self.minimum = None

            # check if the given path exists already and create it if it doesn't
            mkdir(self.path)

            # raise an Exception if it isn't a directory
            if not isdir(self.path):
                raise ValueError("'path': %s must be a directory")

            mkdir(self.dirname('meta', 'categories'))
            self.flush_meta()
            self.save_partitions()
        else:
            raise ValueError(
                "must specify a 'template' when creating a new Castra")

    def _empty_dataframe(self):
        data = dict((n, pd.Series([], dtype=d, name=n))
                    for (n, d) in self.dtypes.iteritems())
        index = pd.Index([], name=self.axis_names[0])
        columns = pd.Index(self.columns, name=self.axis_names[1])
        df = pd.DataFrame(data, columns=columns, index=index)
        return _categorize(self.categories, df)

    def load_meta(self, loads=pickle.loads):
        for name in self.meta_fields:
            with open(self.dirname('meta', name), 'rb') as f:
                setattr(self, name, loads(f.read()))

    def flush_meta(self, dumps=partial(pickle.dumps, protocol=2)):
        if self._readonly:
            raise IOError('File not open for writing')
        for name in self.meta_fields:
            with open(self.dirname('meta', name), 'wb') as f:
                f.write(dumps(getattr(self, name)))

    def load_partitions(self, loads=pickle.loads):
        with open(self.dirname('meta', 'plist'), 'rb') as f:
            self.partitions = loads(f.read())
        with open(self.dirname('meta', 'minimum'), 'rb') as f:
            self.minimum = loads(f.read())

    def save_partitions(self, dumps=partial(pickle.dumps, protocol=2)):
        if self._readonly:
            raise IOError('File not open for writing')
        with open(self.dirname('meta', 'minimum'), 'wb') as f:
            f.write(dumps(self.minimum))
        with open(self.dirname('meta', 'plist'), 'wb') as f:
            f.write(dumps(self.partitions))

    def append_categories(self, new, dumps=partial(pickle.dumps, protocol=2)):
        if self._readonly:
            raise IOError('File not open for writing')
        separator = b'-sep-'
        for col, cat in new.items():
            if cat:
                with open(self.dirname('meta', 'categories', col), 'ab') as f:
                    f.write(separator.join(map(dumps, cat)))
                    f.write(separator)

    def load_categories(self, loads=pickle.loads):
        separator = b'-sep-'
        self.categories = dict()
        for col in list(self.columns) + ['.index']:
            fn = self.dirname('meta', 'categories', col)
            if os.path.exists(fn):
                with open(fn, 'rb') as f:
                    text = f.read()
                self.categories[col] = [loads(x)
                                        for x in text.split(separator)[:-1]]

    def extend(self, df):
        if self._readonly:
            raise IOError('File not open for writing')
        if len(df) == 0:
            return
        # TODO: Ensure that df is consistent with existing data
        if not df.index.is_monotonic_increasing:
            df = df.sort_index(inplace=False)

        new_categories, self.categories, df = _decategorize(self.categories,
                                                            df)
        self.append_categories(new_categories)

        if len(self.partitions) and df.index[0] <= self.partitions.index[-1]:
            if is_trivial_index(df.index):
                df = df.copy()
                start = self.partitions.index[-1] + 1
                new_index = pd.Index(np.arange(start, start + len(df)),
                                     name = df.index.name)
                df.index = new_index
            else:
                raise ValueError("Index of new dataframe less than known data")

        index = df.index.values
        partition_name = '--'.join([escape(index.min()), escape(index.max())])

        mkdir(self.dirname(partition_name))

        # Store columns
        for col in df.columns:
            pack_file(df[col].values, self.dirname(partition_name, col))

        # Store index
        fn = self.dirname(partition_name, '.index')
        bloscpack.pack_ndarray_file(index, fn, bloscpack_args=bp_args,
                                    blosc_args=blosc_args(index.dtype))

        if not len(self.partitions):
            self.minimum = coerce_index(index.dtype, index.min())
        self.partitions.loc[index.max()] = partition_name
        self.flush()

    def extend_sequence(self, seq, freq=None):
        """Add dataframes from an iterable, optionally repartitioning by freq.

        Parameters
        ----------
        seq : iterable
            An iterable of dataframes
        freq : frequency, optional
            A pandas datetime offset. If provided, the dataframes will be
            partitioned by this frequency.
        """
        if self._readonly:
            raise IOError('File not open for writing')
        if isinstance(freq, str):
            freq = pd.datetools.to_offset(freq)
            partitioner = lambda buf, df: partitionby_freq(freq, buf, df)
        elif freq is None:
            partitioner = partitionby_none
        else:
            raise ValueError("Invalid 'freq': {0}".format(repr(freq)))
        buf = self._empty_dataframe()
        for df in seq:
            write, buf = partitioner(buf, df)
            for frame in write:
                self.extend(frame)
        if buf is not None and not buf.empty:
            self.extend(buf)

    def dirname(self, *args):
        return os.path.join(self.path, *list(map(escape, args)))

    def load_partition(self, name, columns, categorize=True):
        if isinstance(columns, Iterator):
            columns = list(columns)
        if '.index' in self.categories and name in self.partitions.index:
            name = self.categories['.index'].index(name) - 1
        if not isinstance(columns, list):
            df = self.load_partition(name, [columns], categorize=categorize)
            return df.iloc[:, 0]
        arrays = [unpack_file(self.dirname(name, col)) for col in columns]

        df = pd.DataFrame(dict(zip(columns, arrays)),
                          columns=pd.Index(columns, name=self.axis_names[1],
                                           tupleize_cols=False),
                          index=self.load_index(name))
        if categorize:
            df = _categorize(self.categories, df)
        return df

    def load_index(self, name):
        return pd.Index(unpack_file(self.dirname(name, '.index')),
                        dtype=self.index_dtype,
                        name=self.axis_names[0],
                        tupleize_cols=False)

    def __getitem__(self, key):
        if isinstance(key, tuple):
            key, columns = key
        else:
            columns = self.columns
        if isinstance(columns, slice):
            columns = self.columns[columns]

        if isinstance(key, slice):
            start, stop = key.start, key.stop
        else:
            start, stop = key, key

        if '.index' in self.categories:
            if start is not None:
                start = self.categories['.index'].index(start)
            if stop is not None:
                stop = self.categories['.index'].index(stop)
        key = slice(start, stop)

        names = select_partitions(self.partitions, key)

        if not names:
            return self._empty_dataframe()[columns]

        data_frames = [self.load_partition(name, columns, categorize=False)
                       for name in names]

        data_frames[0] = data_frames[0].loc[start:]
        data_frames[-1] = data_frames[-1].loc[:stop]
        df = pd.concat(data_frames)
        df = _categorize(self.categories, df)
        return df

    def drop(self):
        if self._readonly:
            raise IOError('File not open for writing')
        if os.path.exists(self.path):
            shutil.rmtree(self.path)

    def flush(self):
        if self._readonly:
            raise IOError('File not open for writing')
        self.save_partitions()

    def __enter__(self):
        return self

    def __exit__(self, *args):
        if not self._explicitly_given_path:
            self.drop()
        elif not self._readonly:
            self.flush()

    __del__ = __exit__

    def __getstate__(self):
        if not self._readonly:
            self.flush()
        return (self.path, self._explicitly_given_path, self._readonly)

    def __setstate__(self, state):
        self.path = state[0]
        self._explicitly_given_path = state[1]
        self._readonly = state[2]
        self.load_meta()
        self.load_partitions()
        self.load_categories()

    def to_dask(self, columns=None):
        import dask.dataframe as dd

        meta = self._empty_dataframe()
        if columns is None:
            columns = self.columns
        else:
            meta = meta[columns]

        token = md5(str((self.path, os.path.getmtime(self.path))).encode()).hexdigest()
        name = 'from-castra-' + token

        divisions = [self.minimum] + self.partitions.index.tolist()
        if '.index' in self.categories:
            divisions = ([self.categories['.index'][0]]
                       + [self.categories['.index'][d + 1] for d in divisions[1:-1]]
                       + [self.categories['.index'][-1]])

        key_parts = list(enumerate(self.partitions.values))

        dsk = dict(((name, i), (Castra.load_partition, self, part, columns))
                   for i, part in key_parts)
        if isinstance(columns, list):
            return dd.DataFrame(dsk, name, meta, divisions)
        else:
            return dd.Series(dsk, name, meta, divisions)


def pack_file(x, fn, encoding='utf8'):
    """ Pack numpy array into filename

    Supports binary data with bloscpack and text data with msgpack+blosc

    >>> pack_file(np.array([1, 2, 3]), 'foo.blp')  # doctest: +SKIP

    See also:
        unpack_file
    """
    if x.dtype != 'O':
        bloscpack.pack_ndarray_file(x, fn, bloscpack_args=bp_args,
                blosc_args=blosc_args(x.dtype))
    else:
        bytes = blosc.compress(msgpack.packb(x.tolist(), encoding=encoding), 1)
        with open(fn, 'wb') as f:
            f.write(bytes)


def unpack_file(fn, encoding='utf8'):
    """ Unpack numpy array from filename

    Supports binary data with bloscpack and text data with msgpack+blosc

    >>> unpack_file('foo.blp')  # doctest: +SKIP
    array([1, 2, 3])

    See also:
        pack_file
    """
    try:
        return bloscpack.unpack_ndarray_file(fn)
    except ValueError:
        with open(fn, 'rb') as f:
            data = msgpack.unpackb(blosc.decompress(f.read()),
                                   encoding=encoding)
            return np.array(data, object, copy=False)


def coerce_index(dt, o):
    if np.issubdtype(dt, np.datetime64):
        return pd.Timestamp(o)
    return o


def select_partitions(partitions, key):
    """ Select partitions from partition list given slice

    >>> p = pd.Series(['a', 'b', 'c', 'd', 'e'], index=[0, 10, 20, 30, 40])
    >>> select_partitions(p, slice(3, 25))
    ['b', 'c', 'd']
    """
    assert key.step is None, 'step must be None but was %s' % key.step
    start, stop = key.start, key.stop
    if start is not None:
        start = coerce_index(partitions.index.dtype, start)
        istart = partitions.index.searchsorted(start)
    else:
        istart = 0
    if stop is not None:
        stop = coerce_index(partitions.index.dtype, stop)
        istop = partitions.index.searchsorted(stop)
    else:
        istop = len(partitions) - 1

    names = partitions.iloc[istart: istop + 1].values.tolist()
    return names


def _decategorize(categories, df):
    """ Strip object dtypes from dataframe, update categories

    Given a DataFrame

    >>> df = pd.DataFrame({'x': [1, 2, 3], 'y': ['C', 'B', 'B']})

    And a dict of known categories

    >>> _ = categories = {'y': ['A', 'B']}

    Update dict and dataframe in place

    >>> extra, categories, df = _decategorize(categories, df)
    >>> extra
    {'y': ['C']}
    >>> categories
    {'y': ['A', 'B', 'C']}
    >>> df
       x  y
    0  1  2
    1  2  1
    2  3  1
    """
    extra = dict()
    new_categories = dict()
    new_columns = dict((col, df[col].values) for col in df.columns)
    for col, cat in categories.items():
        if col == '.index' or col not in df.columns:
            continue
        idx = pd.Index(df[col])
        idx = getattr(idx, 'categories', idx)
        ex = idx[~idx.isin(cat)].unique()
        if any(pd.isnull(c) for c in cat):
            ex = ex[~pd.isnull(ex)]
        extra[col] = ex.tolist()
        new_categories[col] = cat + extra[col]
        new_columns[col] = pd.Categorical(df[col].values, new_categories[col]).codes

    if '.index' in categories:
        idx = df.index
        idx = getattr(idx, 'categories', idx)
        ex = idx[~idx.isin(cat)].unique()
        if any(pd.isnull(c) for c in cat):
            ex = ex[~pd.isnull(ex)]
        extra['.index'] = ex.tolist()
        new_categories['.index'] = cat + extra['.index']

        new_index = pd.Categorical(df.index, new_categories['.index']).codes
        new_index = pd.Index(new_index, name=df.index.name)
    else:
        new_index = df.index

    new_df = pd.DataFrame(new_columns, columns=df.columns, index=new_index)
    return extra, new_categories, new_df


def make_categorical(s, categories):
    name = '.index' if isinstance(s, pd.Index) else s.name
    if name in categories:
        idx = pd.Index(categories[name], tupleize_cols=False, dtype='object')
        idx.is_unique = True
        cat = pd.Categorical(s.values, categories=idx, fastpath=True, ordered=False)
        return pd.CategoricalIndex(cat, name=s.name, ordered=True) if name == '.index' else cat
    return s if name == '.index' else s.values



def _categorize(categories, df):
    """ Categorize columns in dataframe

    >>> df = pd.DataFrame({'x': [1, 2, 3], 'y': [0, 2, 0]})
    >>> categories = {'y': ['A', 'B', 'c']}
    >>> _categorize(categories, df)
       x  y
    0  1  A
    1  2  c
    2  3  A
    """
    if isinstance(df, pd.Series):
        return pd.Series(make_categorical(df, categories),
                         index=make_categorical(df.index, categories),
                         name=df.name)
    else:
        return pd.DataFrame(dict((col, make_categorical(df[col], categories))
                                 for col in df.columns),
                            columns=df.columns,
                            index=make_categorical(df.index, categories))


def partitionby_none(buf, new):
    """Repartition to ensure partitions don't split duplicate indices"""
    if new.empty:
        return [], buf
    elif buf.empty:
        return [], new
    if not new.index.is_monotonic_increasing:
        new = new.sort_index(inplace=False)
    end = buf.index[-1]
    if end >= new.index[0] and not is_trivial_index(new.index):
        i = new.index.searchsorted(end, side='right')
        # Only need to concat, `castra.extend` will resort if needed
        buf = pd.concat([buf, new.iloc[:i]])
        new = new.iloc[i:]
    return [buf], new


def partitionby_freq(freq, buf, new):
    """Partition frames into blocks by a freq"""
    df = pd.concat([buf, new])
    if not df.index.is_monotonic_increasing:
        df = df.sort_index(inplace=False)
    start, end = pd.tseries.resample._get_range_edges(df.index[0],
                                                      df.index[-1], freq)
    inds = [df.index.searchsorted(i) for i in
            pd.date_range(start, end, freq=freq)[1:]]
    slices = [(inds[i-1], inds[i]) if i else (0, inds[i]) for i in
              range(len(inds))]
    frames = [df.iloc[i:j] for (i, j) in slices]
    return frames[:-1], frames[-1]


def is_trivial_index(ind):
    """ Is this index just 0..n ?

    If so then we can probably ignore or change it around as necessary

    >>> is_trivial_index(pd.Index([0, 1, 2]))
    True

    >>> is_trivial_index(pd.Index([0, 3, 5]))
    False
    """
    return ind[0] == 0 and (ind == np.arange(len(ind))).all()


================================================
FILE: castra/tests/test_core.py
================================================
import os
import tempfile
import pickle
import shutil

import pandas as pd
import pandas.util.testing as tm

import pytest

import numpy as np

from castra import Castra
from castra.core import mkdir, select_partitions, _decategorize, _categorize


A = pd.DataFrame({'x': [1, 2],
                  'y': [1., 2.]},
                 columns=['x', 'y'],
                 index=[1, 2])

B = pd.DataFrame({'x': [10, 20],
                  'y': [10., 20.]},
                 columns=['x', 'y'],
                 index=[10, 20])


C = pd.DataFrame({'x': [10, 20],
                  'y': [10., 20.],
                  'z': [0, 1]},
                 columns=['x', 'y', 'z']).set_index('z')
C.columns.name = 'cols'


@pytest.yield_fixture
def base():
    d = tempfile.mkdtemp(prefix='castra-')
    try:
        yield d
    finally:
        shutil.rmtree(d)


def test_safe_mkdir_with_new(base):
    path = os.path.join(base, 'db')
    mkdir(path)
    assert os.path.exists(path)
    assert os.path.isdir(path)


def test_safe_mkdir_with_existing(base):
    # an existing path should not raise an exception
    mkdir(base)


def test_create_with_random_directory():
    Castra(template=A)


def test_create_with_non_existing_path(base):
    path = os.path.join(base, 'db')
    Castra(path=path, template=A)


def test_create_with_existing_path(base):
    Castra(path=base, template=A)


def test_get_empty(base):
    df = Castra(path=base, template=A)[:]
    assert (df.columns == A.columns).all()


def test_get_empty_result(base):
    c = Castra(path=base, template=A)
    c.extend(A)

    df = c[100:200]

    assert (df.columns == A.columns).all()


def test_get_slice(base):
    c = Castra(path=base, template=A)
    c.extend(A)

    tm.assert_frame_equal(c[:], c[:, :])
    tm.assert_frame_equal(c[:, 1:], c[:][['y']])


def test_exception_with_non_dir(base):
    file_ = os.path.join(base, 'file')
    with open(file_, 'w') as f:
        f.write('file')
    with pytest.raises(ValueError):
        Castra(file_)


def test_exception_with_existing_castra_and_template(base):
    with Castra(path=base, template=A) as c:
        c.extend(A)
    with pytest.raises(ValueError):
        Castra(path=base, template=A)


def test_exception_with_empty_dir_and_no_template(base):
    with pytest.raises(ValueError):
        Castra(path=base)


def test_load(base):
    with Castra(path=base, template=A) as c:
        c.extend(A)
        c.extend(B)

    loaded = Castra(path=base)
    tm.assert_frame_equal(pd.concat([A, B]), loaded[:])


def test_del_with_random_dir():
    c = Castra(template=A)
    assert os.path.exists(c.path)
    c.__del__()
    assert not os.path.exists(c.path)


def test_context_manager_with_random_dir():
    with Castra(template=A) as c:
        assert os.path.exists(c.path)
    assert not os.path.exists(c.path)


def test_context_manager_with_specific_dir(base):
    with Castra(path=base, template=A) as c:
        assert os.path.exists(c.path)
    assert os.path.exists(c.path)


def test_timeseries():
    indices = [pd.DatetimeIndex(start=str(i), end=str(i+1), freq='w')
               for i in range(2000, 2015)]
    dfs = [pd.DataFrame({'x': list(range(len(ind)))}, ind).iloc[:-1]
           for ind in indices]

    with Castra(template=dfs[0]) as c:
        for df in dfs:
            c.extend(df)
        df = c['2010-05': '2013-02']
        assert len(df) > 100


def test_Castra():
    c = Castra(template=A)
    c.extend(A)
    c.extend(B)

    assert c.columns == ['x', 'y']

    tm.assert_frame_equal(c[0:100], pd.concat([A, B]))
    tm.assert_frame_equal(c[:5], A)
    tm.assert_frame_equal(c[5:], B)

    tm.assert_frame_equal(c[2:5], A[1:])
    tm.assert_frame_equal(c[2:15], pd.concat([A[1:], B[:1]]))


def test_pickle_Castra():
    path = tempfile.mkdtemp(prefix='castra-')
    c = Castra(path=path, template=A)
    c.extend(A)
    c.extend(B)

    dumped = pickle.dumps(c)
    undumped = pickle.loads(dumped)

    tm.assert_frame_equal(pd.concat([A, B]), undumped[:])


def test_text():
    df = pd.DataFrame({'name': ['Alice', 'Bob'],
                       'balance': [100, 200]}, columns=['name', 'balance'])
    with Castra(template=df) as c:
        c.extend(df)

        tm.assert_frame_equal(c[:], df)


def test_column_access():
    with Castra(template=A) as c:
        c.extend(A)
        c.extend(B)
        df = c[:, ['x']]

        tm.assert_frame_equal(df, pd.concat([A[['x']], B[['x']]]))

        df = c[:, 'x']
        tm.assert_series_equal(df, pd.concat([A.x, B.x]))


def test_reload():
    path = tempfile.mkdtemp(prefix='castra-')
    try:
        c = Castra(template=A, path=path)
        c.extend(A)

        d = Castra(path=path)

        assert c.columns == d.columns
        assert (c.partitions == d.partitions).all()
        assert c.minimum == d.minimum
    finally:
        shutil.rmtree(path)


def test_readonly():
    path = tempfile.mkdtemp(prefix='castra-')
    try:
        c = Castra(path=path, template=A)
        c.extend(A)
        d = Castra(path=path, readonly=True)
        with pytest.raises(IOError):
            d.extend(B)
        with pytest.raises(IOError):
            d.extend_sequence([B])
        with pytest.raises(IOError):
            d.flush()
        with pytest.raises(IOError):
            d.drop()
        with pytest.raises(IOError):
            d.save_partitions()
        with pytest.raises(IOError):
            d.flush_meta()
        assert c.columns == d.columns
        assert (c.partitions == d.partitions).all()
        assert c.minimum == d.minimum
    finally:
        shutil.rmtree(path)


def test_index_dtype_matches_template():
    with Castra(template=A) as c:
        assert c.partitions.index.dtype == A.index.dtype


def test_to_dask_dataframe():
    pytest.importorskip('dask.dataframe')

    try:
        import dask.dataframe as dd
    except ImportError:
        return

    with Castra(template=A) as c:
        c.extend(A)
        c.extend(B)

        df = c.to_dask()
        assert isinstance(df, dd.DataFrame)
        assert list(df.divisions) == [1, 2, 20]
        tm.assert_frame_equal(df.compute(), c[:])

        df = c.to_dask('x')
        assert isinstance(df, dd.Series)
        assert list(df.divisions) == [1, 2, 20]
        tm.assert_series_equal(df.compute(), c[:, 'x'])


def test_categorize():
    A = pd.DataFrame({'x': [1, 2, 3], 'y': ['A', None, 'A']},
                     columns=['x', 'y'], index=[0, 10, 20])
    B = pd.DataFrame({'x': [4, 5, 6], 'y': ['C', None, 'A']},
                     columns=['x', 'y'], index=[30, 40, 50])

    with Castra(template=A, categories=['y']) as c:
        c.extend(A)
        assert c[:].dtypes['y'] == 'category'
        assert c[:]['y'].cat.codes.dtype == np.dtype('i1')
        assert list(c[:, 'y'].cat.categories) == ['A', None]

        c.extend(B)
        assert list(c[:, 'y'].cat.categories) == ['A', None, 'C']

        assert c.load_partition(c.partitions.iloc[0], 'y').dtype == 'category'

        c.flush()

        d = Castra(path=c.path)
        tm.assert_frame_equal(c[:], d[:])


def test_save_axis_names():
    with Castra(template=C) as c:
        c.extend(C)
        assert c[:].index.name == 'z'
        assert c[:].columns.name == 'cols'
        tm.assert_frame_equal(c[:], C)


def test_same_categories_when_already_categorized():
    A = pd.DataFrame({'x': [1, 2] * 1000,
                      'y': [1., 2.] * 1000,
                      'z': np.random.choice(list('abc'), size=2000)},
                     columns=list('xyz'))
    A['z'] = A.z.astype('category')
    with Castra(template=A, categories=['z']) as c:
        c.extend(A)
        assert c.categories['z'] == A.z.cat.categories.tolist()


def test_category_dtype():
    A = pd.DataFrame({'x': [1, 2] * 3,
                      'y': [1., 2.] * 3,
                      'z': list('abcabc')},
                     columns=list('xyz'))
    with Castra(template=A, categories=['z']) as c:
        c.extend(A)
        assert A.dtypes['z'] == 'object'


def test_do_not_create_dirs_if_template_fails():
    A = pd.DataFrame({'x': [1, 2] * 3,
                      'y': [1., 2.] * 3,
                      'z': list('abcabc')},
                     columns=list('xyz'))
    with pytest.raises(ValueError):
        Castra(template=A, path='foo', categories=['w'])
    assert not os.path.exists('foo')


def test_sort_on_extend():
    df = pd.DataFrame({'x': [1, 2, 3]}, index=[3, 2, 1])
    expected = pd.DataFrame({'x': [3, 2, 1]}, index=[1, 2, 3])
    with Castra(template=df) as c:
        c.extend(df)
        tm.assert_frame_equal(c[:], expected)


def test_select_partitions():
    p = pd.Series(['a', 'b', 'c', 'd', 'e'], index=[0, 10, 20, 30, 40])
    assert select_partitions(p, slice(3, 25)) == ['b', 'c', 'd']
    assert select_partitions(p, slice(None, 25)) == ['a', 'b', 'c', 'd']
    assert select_partitions(p, slice(3, None)) == ['b', 'c', 'd', 'e']
    assert select_partitions(p, slice(None, None)) == ['a', 'b', 'c', 'd', 'e']
    assert select_partitions(p, slice(10, 30)) == ['b', 'c', 'd']


def test_first_index_is_timestamp():
    pytest.importorskip('dask.dataframe')

    df = pd.DataFrame({'x': [1, 2] * 3,
                       'y': [1., 2.] * 3,
                       'z': list('abcabc')},
                      columns=list('xyz'),
                      index=pd.date_range(start='20120101', periods=6))
    with Castra(template=df) as c:
        c.extend(df)

        assert isinstance(c.minimum, pd.Timestamp)
        assert isinstance(c.to_dask().divisions[0], pd.Timestamp)


def test_minimum_dtype():
    df = tm.makeTimeDataFrame()

    with Castra(template=df) as c:
        c.extend(df)
        assert type(c.minimum) == type(c.partitions.index[0])


def test_many_default_indexes():
    a = pd.DataFrame({'x': [1, 2, 3]})
    b = pd.DataFrame({'x': [4, 5, 6]})
    c = pd.DataFrame({'x': [7, 8, 9]})

    e = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6, 7, 8, 9]})

    with Castra(template=a) as C:
        C.extend(a)
        C.extend(b)
        C.extend(c)

        tm.assert_frame_equal(C[:], e)


def test_raise_error_on_mismatched_index():
    x = pd.DataFrame({'x': [1, 2, 3]}, index=[1, 2, 3])
    y = pd.DataFrame({'x': [1, 2, 3]}, index=[4, 5, 6])
    z = pd.DataFrame({'x': [4, 5, 6]}, index=[5, 6, 7])

    with Castra(template=x) as c:
        c.extend(x)
        c.extend(y)

        with pytest.raises(ValueError):
            c.extend(z)


def test_raise_error_on_equal_index():
    a = pd.DataFrame({'x': [1, 2, 3]}, index=[1, 2, 3])
    b = pd.DataFrame({'x': [4, 5, 6]}, index=[3, 4, 5])

    with Castra(template=a) as c:
        c.extend(a)

        with pytest.raises(ValueError):
            c.extend(b)


def test_categories_nan():
    a = pd.DataFrame({'x': ['A', np.nan]})
    b = pd.DataFrame({'x': ['B', np.nan]})

    with Castra(template=a, categories=['x']) as c:
        c.extend(a)
        c.extend(b)
        assert len(c.categories['x']) == 3


def test_extend_sequence_freq():
    df = pd.util.testing.makeTimeDataFrame(1000, 'min')
    seq = [df.iloc[i:i+100] for i in range(0,1000,100)]
    with Castra(template=df) as c:
        c.extend_sequence(seq, freq='h')
        tm.assert_frame_equal(c[:], df)
        parts = pd.date_range(start=df.index[59], freq='h',
                              periods=16).insert(17, df.index[-1])
        tm.assert_index_equal(c.partitions.index, parts)

    with Castra(template=df) as c:
        c.extend_sequence(seq, freq='d')
        tm.assert_frame_equal(c[:], df)
        assert len(c.partitions) == 1


def test_extend_sequence_none():
    data = {'a': range(5), 'b': range(5)}
    p1 = pd.DataFrame(data, index=[1, 2, 3, 4, 5])
    p2 = pd.DataFrame(data, index=[5, 5, 5, 6, 7])
    p3 = pd.DataFrame(data, index=[7, 9, 10, 11, 12])
    seq = [p1, p2, p3]
    df = pd.concat(seq)
    with Castra(template=df) as c:
        c.extend_sequence(seq)
        tm.assert_frame_equal(c[:], df)
        assert len(c.partitions) == 3
        assert len(c.load_partition('1--5', ['a', 'b']).index) == 8
        assert len(c.load_partition('6--7', ['a', 'b']).index) == 3
        assert len(c.load_partition('9--12', ['a', 'b']).index) == 4


def test_extend_sequence_overlap():
    df = pd.util.testing.makeTimeDataFrame(20, 'min')
    p1 = df.iloc[:15]
    p2 = df.iloc[10:20]
    seq = [p1,p2]
    df = pd.concat(seq)
    with Castra(template=df) as c:
        c.extend_sequence(seq)
        tm.assert_frame_equal(c[:], df.sort_index())
        assert (c.partitions.index == [p.index[-1] for p in seq]).all()
    # Check with trivial index
    p1 = pd.DataFrame({'a': range(10), 'b': range(10)})
    p2 = pd.DataFrame({'a': range(10, 17), 'b': range(10, 17)})
    seq = [p1,p2]
    df = pd.DataFrame({'a': range(17), 'b': range(17)})
    with Castra(template=df) as c:
        c.extend_sequence(seq)
        tm.assert_frame_equal(c[:], df)
        assert (c.partitions.index == [9, 16]).all()


def test_extend_sequence_single_frame():
    df = pd.util.testing.makeTimeDataFrame(100, 'h')
    seq = [df]
    with Castra(template=df) as c:
        c.extend_sequence(seq, freq='d')
        assert (c.partitions.index == ['2000-01-01 23:00:00', '2000-01-02 23:00:00',
                 '2000-01-03 23:00:00', '2000-01-04 23:00:00', '2000-01-05 03:00:00']).all()
    df = pd.DataFrame({'a': range(10), 'b': range(10)})
    seq = [df]
    with Castra(template=df) as c:
        c.extend_sequence(seq)
        tm.assert_frame_equal(c[:], df)


def test_column_with_period():
    df = pd.DataFrame({'x': [10, 20],
                       '.': [10., 20.]},
                       columns=['x', '.'],
                       index=[10, 20])

    with Castra(template=df) as c:
        c.extend(df)


def test_empty():
    with Castra(template=A) as c:
        c.extend(pd.DataFrame(columns=A.columns))
        assert len(c[:]) == 0


def test_index_with_single_value():
    df = pd.DataFrame({'x': [1, 2, 3]}, index=[1, 1, 2])
    with Castra(template=df) as c:
        c.extend(df)

        tm.assert_frame_equal(c[1], df.loc[1])


def test_categorical_index():
    df = pd.DataFrame({'x': [1, 2, 3]},
            index=pd.CategoricalIndex(['a', 'a', 'b'], ordered=True, name='foo'))

    with Castra(template=df, categories=True) as c:
        c.extend(df)
        result = c[:]
        tm.assert_frame_equal(c[:], df)

    A = pd.DataFrame({'x': [1, 2, 3]},
                    index=pd.Index(['a', 'a', 'b'], name='foo'))
    B = pd.DataFrame({'x': [4, 5, 6]},
                    index=pd.Index(['c', 'd', 'd'], name='foo'))

    path = tempfile.mkdtemp(prefix='castra-')
    try:
        with Castra(path=path, template=A, categories=['foo']) as c:
            c.extend(A)
            c.extend(B)

            c2 = Castra(path=path)
            result = c2[:]

            expected = pd.concat([A, B])
            expected.index = pd.CategoricalIndex(expected.index,
                    name=expected.index.name, ordered=True)
            tm.assert_frame_equal(result, expected)

            tm.assert_frame_equal(c['a'], expected.loc['a'])
    finally:
        shutil.rmtree(path)


def test_categorical_index_with_dask_dataframe():
    pytest.importorskip('dask.dataframe')
    import dask.dataframe as dd
    import dask

    A = pd.DataFrame({'x': [1, 2, 3, 4]},
                    index=pd.Index(['a', 'a', 'b', 'b'], name='foo'))
    B = pd.DataFrame({'x': [4, 5, 6]},
                    index=pd.Index(['c', 'd', 'd'], name='foo'))


    path = tempfile.mkdtemp(prefix='castra-')
    try:
        with Castra(path=path, template=A, categories=['foo']) as c:
            c.extend(A)
            c.extend(B)

            df = dd.from_castra(path)
            assert df.divisions == ('a', 'c', 'd')

            result = df.compute(get=dask.async.get_sync)

            expected = pd.concat([A, B])
            expected.index = pd.CategoricalIndex(expected.index,
                    name=expected.index.name, ordered=True)

            tm.assert_frame_equal(result, expected)

            tm.assert_frame_equal(df.loc['a'].compute(), expected.loc['a'])
            tm.assert_frame_equal(df.loc['b'].compute(get=dask.async.get_sync),
                                  expected.loc['b'])
    finally:
        shutil.rmtree(path)


def test__decategorize():
    df = pd.DataFrame({'x': [1, 2, 3]},
                      index=pd.CategoricalIndex(['a', 'a', 'b'], ordered=True,
                          name='foo'))

    extra, categories, df2 = _decategorize({'.index': []}, df)

    assert (df2.index == [0, 0, 1]).all()

    df3 = _categorize(categories, df2)

    tm.assert_frame_equal(df, df3)


================================================
FILE: requirements.txt
================================================
numpy
pandas
bloscpack>=0.8.0
blosc


================================================
FILE: setup.py
================================================
#!/usr/bin/env python

from os.path import exists
from setuptools import setup

setup(name='castra',
      version='0.1.8',
      description='On-disk partitioned store',
      url='http://github.com/blaze/Castra/',
      maintainer='Matthew Rocklin',
      maintainer_email='mrocklin@gmail.com',
      license='BSD',
      keywords='',
      packages=['castra'],
      package_data={'castra': ['tests/*.py']},
      install_requires=list(open('requirements.txt').read().strip().split('\n')),
      long_description=(open('README.rst').read() if exists('README.rst')
                        else ''),
      zip_safe=False)

Download .txt

gitextract_9pxf5njk/

├── .gitignore
├── .travis.yml
├── LICENSE
├── MANIFEST.in
├── README.rst
├── castra/
│   ├── __init__.py
│   ├── core.py
│   └── tests/
│       └── test_core.py
├── requirements.txt
└── setup.py

Download .txt

SYMBOL INDEX (83 symbols across 2 files)

FILE: castra/core.py
  function blosc_args (line 29) | def blosc_args(dt):
  function escape (line 43) | def escape(text):
  function mkdir (line 58) | def mkdir(path):
  class Castra (line 63) | class Castra(object):
    method __init__ (line 66) | def __init__(self, path=None, template=None, categories=None, readonly...
    method _empty_dataframe (line 143) | def _empty_dataframe(self):
    method load_meta (line 151) | def load_meta(self, loads=pickle.loads):
    method flush_meta (line 156) | def flush_meta(self, dumps=partial(pickle.dumps, protocol=2)):
    method load_partitions (line 163) | def load_partitions(self, loads=pickle.loads):
    method save_partitions (line 169) | def save_partitions(self, dumps=partial(pickle.dumps, protocol=2)):
    method append_categories (line 177) | def append_categories(self, new, dumps=partial(pickle.dumps, protocol=...
    method load_categories (line 187) | def load_categories(self, loads=pickle.loads):
    method extend (line 198) | def extend(self, df):
    method extend_sequence (line 240) | def extend_sequence(self, seq, freq=None):
    method dirname (line 268) | def dirname(self, *args):
    method load_partition (line 271) | def load_partition(self, name, columns, categorize=True):
    method load_index (line 289) | def load_index(self, name):
    method __getitem__ (line 295) | def __getitem__(self, key):
    method drop (line 329) | def drop(self):
    method flush (line 335) | def flush(self):
    method __enter__ (line 340) | def __enter__(self):
    method __exit__ (line 343) | def __exit__(self, *args):
    method __getstate__ (line 351) | def __getstate__(self):
    method __setstate__ (line 356) | def __setstate__(self, state):
    method to_dask (line 364) | def to_dask(self, columns=None):
  function pack_file (line 392) | def pack_file(x, fn, encoding='utf8'):
  function unpack_file (line 411) | def unpack_file(fn, encoding='utf8'):
  function coerce_index (line 431) | def coerce_index(dt, o):
  function select_partitions (line 437) | def select_partitions(partitions, key):
  function _decategorize (line 461) | def _decategorize(categories, df):
  function make_categorical (line 518) | def make_categorical(s, categories):
  function _categorize (line 529) | def _categorize(categories, df):
  function partitionby_none (line 551) | def partitionby_none(buf, new):
  function partitionby_freq (line 568) | def partitionby_freq(freq, buf, new):
  function is_trivial_index (line 583) | def is_trivial_index(ind):

FILE: castra/tests/test_core.py
  function base (line 36) | def base():
  function test_safe_mkdir_with_new (line 44) | def test_safe_mkdir_with_new(base):
  function test_safe_mkdir_with_existing (line 51) | def test_safe_mkdir_with_existing(base):
  function test_create_with_random_directory (line 56) | def test_create_with_random_directory():
  function test_create_with_non_existing_path (line 60) | def test_create_with_non_existing_path(base):
  function test_create_with_existing_path (line 65) | def test_create_with_existing_path(base):
  function test_get_empty (line 69) | def test_get_empty(base):
  function test_get_empty_result (line 74) | def test_get_empty_result(base):
  function test_get_slice (line 83) | def test_get_slice(base):
  function test_exception_with_non_dir (line 91) | def test_exception_with_non_dir(base):
  function test_exception_with_existing_castra_and_template (line 99) | def test_exception_with_existing_castra_and_template(base):
  function test_exception_with_empty_dir_and_no_template (line 106) | def test_exception_with_empty_dir_and_no_template(base):
  function test_load (line 111) | def test_load(base):
  function test_del_with_random_dir (line 120) | def test_del_with_random_dir():
  function test_context_manager_with_random_dir (line 127) | def test_context_manager_with_random_dir():
  function test_context_manager_with_specific_dir (line 133) | def test_context_manager_with_specific_dir(base):
  function test_timeseries (line 139) | def test_timeseries():
  function test_Castra (line 152) | def test_Castra():
  function test_pickle_Castra (line 167) | def test_pickle_Castra():
  function test_text (line 179) | def test_text():
  function test_column_access (line 188) | def test_column_access():
  function test_reload (line 200) | def test_reload():
  function test_readonly (line 215) | def test_readonly():
  function test_index_dtype_matches_template (line 240) | def test_index_dtype_matches_template():
  function test_to_dask_dataframe (line 245) | def test_to_dask_dataframe():
  function test_categorize (line 268) | def test_categorize():
  function test_save_axis_names (line 291) | def test_save_axis_names():
  function test_same_categories_when_already_categorized (line 299) | def test_same_categories_when_already_categorized():
  function test_category_dtype (line 310) | def test_category_dtype():
  function test_do_not_create_dirs_if_template_fails (line 320) | def test_do_not_create_dirs_if_template_fails():
  function test_sort_on_extend (line 330) | def test_sort_on_extend():
  function test_select_partitions (line 338) | def test_select_partitions():
  function test_first_index_is_timestamp (line 347) | def test_first_index_is_timestamp():
  function test_minimum_dtype (line 362) | def test_minimum_dtype():
  function test_many_default_indexes (line 370) | def test_many_default_indexes():
  function test_raise_error_on_mismatched_index (line 385) | def test_raise_error_on_mismatched_index():
  function test_raise_error_on_equal_index (line 398) | def test_raise_error_on_equal_index():
  function test_categories_nan (line 409) | def test_categories_nan():
  function test_extend_sequence_freq (line 419) | def test_extend_sequence_freq():
  function test_extend_sequence_none (line 435) | def test_extend_sequence_none():
  function test_extend_sequence_overlap (line 451) | def test_extend_sequence_overlap():
  function test_extend_sequence_single_frame (line 472) | def test_extend_sequence_single_frame():
  function test_column_with_period (line 486) | def test_column_with_period():
  function test_empty (line 496) | def test_empty():
  function test_index_with_single_value (line 502) | def test_index_with_single_value():
  function test_categorical_index (line 510) | def test_categorical_index():
  function test_categorical_index_with_dask_dataframe (line 543) | def test_categorical_index_with_dask_dataframe():
  function test__decategorize (line 578) | def test__decategorize():

Download .json

Condensed preview — 10 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (47K chars).

[
  {
    "path": ".gitignore",
    "chars": 6,
    "preview": "*.pyc\n"
  },
  {
    "path": ".travis.yml",
    "chars": 714,
    "preview": "sudo: False\n\nlanguage: python\n\nmatrix:\n  include:\n    - python: 2.7\n    - python: 3.3\n    - python: 3.4\n    - python: 3."
  },
  {
    "path": "LICENSE",
    "chars": 1543,
    "preview": "Copyright (c) 2015, Continuum Analytics, Inc.\nCopyright (c) 2015, Valentin Haenel <valentin@haenel.co>\nAll rights reser"
  },
  {
    "path": "MANIFEST.in",
    "chars": 180,
    "preview": "recursive-include castra *.py\nrecursive-include docs *.rst\n\ninclude setup.py\ninclude README.rst\ninclude LICENSE.txt\nincl"
  },
  {
    "path": "README.rst",
    "chars": 5088,
    "preview": "Castra\n======\n\n|Build Status|\n\nCastra is an on-disk, partitioned, compressed, column store.\nCastra provides efficient co"
  },
  {
    "path": "castra/__init__.py",
    "chars": 48,
    "preview": "from .core import Castra\n\n__version__ = '0.1.8'\n"
  },
  {
    "path": "castra/core.py",
    "chars": 20550,
    "preview": "from collections import Iterator\n\nimport os\n\nfrom os.path import exists, isdir\n\ntry:\n    import cPickle as pickle\nexcept"
  },
  {
    "path": "castra/tests/test_core.py",
    "chars": 16668,
    "preview": "import os\nimport tempfile\nimport pickle\nimport shutil\n\nimport pandas as pd\nimport pandas.util.testing as tm\n\nimport pyte"
  },
  {
    "path": "requirements.txt",
    "chars": 36,
    "preview": "numpy\npandas\nbloscpack>=0.8.0\nblosc\n"
  },
  {
    "path": "setup.py",
    "chars": 623,
    "preview": "#!/usr/bin/env python\n\nfrom os.path import exists\nfrom setuptools import setup\n\nsetup(name='castra',\n      version='0.1."
  }
]

About this extraction

This page contains the full source code of the blaze/castra GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 10 files (44.4 KB), approximately 11.9k tokens, and a symbol index with 83 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo