[
  {
    "path": ".gitignore",
    "content": "*.pyc\n"
  },
  {
    "path": ".travis.yml",
    "content": "sudo: False\n\nlanguage: python\n\nmatrix:\n  include:\n    - python: 2.7\n    - python: 3.3\n    - python: 3.4\n    - python: 3.5\n\ninstall:\n  # Install conda\n  - wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh\n  - bash miniconda.sh -b -p $HOME/miniconda\n  - export PATH=\"$HOME/miniconda/bin:$PATH\"\n  - conda config --set always_yes yes --set changeps1 no\n  - conda update conda\n\n  # Install dependencies\n  - conda create -n castra python=$TRAVIS_PYTHON_VERSION pytest numpy pandas dask\n  - source activate castra\n  - pip install blosc\n  - pip install bloscpack\n  - pip install dask --upgrade\n\nscript:\n  - py.test -x --doctest-modules --pyargs castra\n\nnotifications:\n  email: false\n"
  },
  {
    "path": "LICENSE",
    "content": "﻿Copyright (c) 2015, Continuum Analytics, Inc.\nCopyright (c) 2015, Valentin Haenel <valentin@haenel.co>\nAll rights reserved.\n\nRedistribution and use in source and binary forms, with or without modification,\nare permitted provided that the following conditions are met:\n\nRedistributions of source code must retain the above copyright notice,\nthis list of conditions and the following disclaimer.\n\nRedistributions in binary form must reproduce the above copyright notice,\nthis list of conditions and the following disclaimer in the documentation\nand/or other materials provided with the distribution.\n\nNeither the name of Continuum Analytics nor the names of any contributors\nmay be used to endorse or promote products derived from this software\nwithout specific prior written permission.\n\nTHIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\"\nAND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE\nIMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE\nARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE\nLIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR\nCONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF\nSUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS\nINTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN\nCONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)\nARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF\nTHE POSSIBILITY OF SUCH DAMAGE.\n"
  },
  {
    "path": "MANIFEST.in",
    "content": "recursive-include castra *.py\nrecursive-include docs *.rst\n\ninclude setup.py\ninclude README.rst\ninclude LICENSE.txt\ninclude requirements.txt\ninclude MANIFEST.in\n\nprune docs/_build\n"
  },
  {
    "path": "README.rst",
    "content": "Castra\n======\n\n|Build Status|\n\nCastra is an on-disk, partitioned, compressed, column store.\nCastra provides efficient columnar range queries.\n\n*  **Efficient on-disk:**  Castra stores data on your hard drive in a way that you can load it quickly, increasing the comfort of inconveniently large data.\n*  **Partitioned:**  Castra partitions your data along an index, allowing rapid loads of ranges of data like \"All records between January and March\"\n*  **Compressed:**  Castra uses Blosc_ to compress data, increasing effective disk bandwidth and decreasing storage costs\n*  **Column-store:**  Castra stores columns separately, drastically reducing I/O costs for analytic queries\n*  **Tabular data:**  Castra plays well with Pandas and is an ideal fit for append-only applications like time-series\n\nMaintenance\n-----------\n\nThis project is no longer actively maintained.  Use at your own risk.\n\nExample\n-------\n\nConsider some Pandas DataFrames\n\n.. code-block:: python\n\n   In [1]: import pandas as pd\n   In [2]: A = pd.DataFrame({'price': [10.0, 11.0], 'volume': [100, 200]},\n      ...:                  index=pd.DatetimeIndex(['2010', '2011']))\n\n   In [3]: B = pd.DataFrame({'price': [12.0, 13.0], 'volume': [300, 400]},\n      ...:                  index=pd.DatetimeIndex(['2012', '2013']))\n\nWe create a Castra with a filename and a template dataframe from which to get\ncolumn name, index, and dtype information\n\n.. code-block:: python\n\n   In [4]: from castra import Castra\n   In [5]: c = Castra('data.castra', template=A)\n\nThe castra starts empty but we can extend it with new dataframes:\n\n.. code-block:: python\n\n   In [6]: c.extend(A)\n\n   In [7]: c[:]\n   Out[7]:\n               price  volume\n   2010-01-01     10     100\n   2011-01-01     11     200\n\n   In [8]: c.extend(B)\n\n   In [9]: c[:]\n   Out[9]:\n               price  volume\n   2010-01-01     10     100\n   2011-01-01     11     200\n   2012-01-01     12     300\n   2013-01-01     13     400\n\nWe can select particular columns\n\n.. code-block:: python\n\n   In [10]: c[:, 'price']\n   Out[10]:\n   2010-01-01    10\n   2011-01-01    11\n   2012-01-01    12\n   2013-01-01    13\n   Name: price, dtype: float64\n\nParticular ranges\n\n.. code-block:: python\n\n   In [12]: c['2011':'2013']\n   Out[12]:\n               price  volume\n   2011-01-01     11     200\n   2012-01-01     12     300\n   2013-01-01     13     400\n\nOr both\n\n.. code-block:: python\n\n   In [13]: c['2011':'2013', 'volume']\n   Out[13]:\n   2011-01-01    200\n   2012-01-01    300\n   2013-01-01    400\n   Name: volume, dtype: int64\n\nStorage\n-------\n\nCastra stores your dataframes as they arrived, you can see the divisions along\nwhich you data is divided.\n\n.. code-block:: python\n\n   In [14]: c.partitions\n   Out[14]:\n   2011-01-01    2009-12-31T16:00:00.000000000-0800--2010-12-31...\n   2013-01-01    2011-12-31T16:00:00.000000000-0800--2012-12-31...\n   dtype: object\n\nEach column in each partition lives in a separate compressed file::\n\n   $ ls -a data.castra/2011-12-31T16:00:00.000000000-0800--2012-12-31T16:00:00.000000000-0800\n   .  ..  .index  price  volume\n\nRestrictions\n------------\n\nCastra is both fast and restrictive.\n\n*  You must always give it dataframes that match its template (same column\n   names, index type, dtypes).\n*  You can only give castra dataframes with **increasing index values**.  For\n   example you can give it one dataframe a day for values on that day.  You can\n   not go back and update previous days.\n\nText and Categoricals\n---------------------\n\nCastra tries to encode text and object dtype columns with\nmsgpack_, using the implementation found in\nthe Pandas library.  It falls back to `pickle` with a high protocol if that\nfails.\n\nAlternatively, Castra can categorize your data as it receives it\n\n.. code-block:: python\n\n   >>> c = Castra('data.castra', template=df, categories=['list', 'of', 'columns'])\n\n   or\n\n   >>> c = Castra('data.castra', template=df, categories=True) # all object dtype columns\n\nCategorizing columns that have repetitive text, like ``'sex'`` or\n``'ticker-symbol'`` can greatly improve both read times and computational\nperformance with Pandas.  See this blogpost_ for more information.\n\n.. _msgpack: http://msgpack.org/index.html\n\n\nDask dataframe\n--------------\n\nCastra interoperates smoothly with dask.dataframe_\n\n.. code-block:: python\n\n   >>> import dask.dataframe as dd\n   >>> df = dd.read_csv('myfiles.*.csv')\n   >>> df.set_index('timestamp', compute=False).to_castra('myfile.castra', categories=True)\n\n   >>> df = dd.from_castra('myfile.castra')\n\nWork in Progress\n----------------\n\nCastra is immature and largely for experimental use.\n\nThe developers do not promise backwards compatibility with future versions.\nYou should treat castra as a very efficient temporary format and archive your\ndata with some other system.\n\n\n\n.. _Blosc: https://github.com/Blosc\n\n.. _dask.dataframe: https://dask.pydata.org/en/latest/dataframe.html\n\n.. _blogpost: http://matthewrocklin.com/blog/work/2015/06/18/Categoricals\n\n.. |Build Status| image:: https://travis-ci.org/blaze/castra.svg\n   :target: https://travis-ci.org/blaze/castra\n"
  },
  {
    "path": "castra/__init__.py",
    "content": "from .core import Castra\n\n__version__ = '0.1.8'\n"
  },
  {
    "path": "castra/core.py",
    "content": "from collections import Iterator\n\nimport os\n\nfrom os.path import exists, isdir\n\ntry:\n    import cPickle as pickle\nexcept ImportError:\n    import pickle\n\nimport shutil\nimport tempfile\nfrom hashlib import md5\n\nfrom functools import partial\n\nimport blosc\nimport bloscpack\n\nimport numpy as np\nimport pandas as pd\n\nfrom pandas import msgpack\n\n\nbp_args = bloscpack.BloscpackArgs(offsets=False, checksum='None')\n\ndef blosc_args(dt):\n    if np.issubdtype(dt, int):\n        return bloscpack.BloscArgs(dt.itemsize, clevel=3, shuffle=True)\n    if np.issubdtype(dt, np.datetime64):\n        return bloscpack.BloscArgs(dt.itemsize, clevel=3, shuffle=True)\n    if np.issubdtype(dt, float):\n        return bloscpack.BloscArgs(dt.itemsize, clevel=1, shuffle=False)\n    return None\n\n\n# http://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename-in-python\nimport string\nvalid_chars = \"-_%s%s\" % (string.ascii_letters, string.digits)\n\ndef escape(text):\n    \"\"\"\n\n    >>> escape(\"Hello!\")  # Remove punctuation from names\n    'Hello'\n\n    >>> escape(\"/!.\")  # completely invalid names produce hash string\n    'cb6698330c63e87fc35933a0474238b0'\n    \"\"\"\n    result = ''.join(c for c in str(text) if c in valid_chars)\n    if not result:\n        result = md5(str(text).encode()).hexdigest()\n    return result\n\n\ndef mkdir(path):\n    if not exists(path):\n        os.makedirs(path)\n\n\nclass Castra(object):\n    meta_fields = ['columns', 'dtypes', 'index_dtype', 'axis_names']\n\n    def __init__(self, path=None, template=None, categories=None, readonly=False):\n        self._readonly = readonly\n        # check if we should create a random path\n        self._explicitly_given_path = path is not None\n\n        if not self._explicitly_given_path:\n            self.path = tempfile.mkdtemp(prefix='castra-')\n        else:\n            self.path = path\n\n        # either we have a meta directory\n        if isdir(self.dirname('meta')):\n            if template is not None:\n                raise ValueError(\n                    \"Opening a castra with a template, yet this castra\\n\"\n                    \"already exists.  Filename: %s\" % self.path)\n            self.load_meta()\n            self.load_partitions()\n            self.load_categories()\n\n        # or we don't, in which case we need a template\n        elif template is not None:\n            if self._readonly:\n                ValueError(\"Can't create new castra in readonly mode\")\n\n            if isinstance(categories, (list, tuple)):\n                if template.index.name in categories:\n                    categories.remove(template.index.name)\n                    categories.append('.index')\n                self.categories = dict((col, []) for col in categories)\n            elif categories is True:\n                self.categories = dict((col, [])\n                                       for col in template.columns\n                                       if template.dtypes[col] == 'object')\n                if isinstance(template.index, pd.CategoricalIndex):\n                    self.categories['.index'] = []\n            else:\n                self.categories = dict()\n\n            if self.categories:\n                categories = set(self.categories)\n                template_categories = set(template.dtypes.index.values)\n                if categories.difference(template_categories) - set(['.index']):\n                    raise ValueError('passed in categories %s are not all '\n                                     'contained in template dataframe columns '\n                                     '%s' % (categories, template_categories))\n\n            template2 = _decategorize(self.categories, template)[2]\n\n            self.columns, self.dtypes, self.index_dtype = \\\n                list(template2.columns), template2.dtypes, template2.index.dtype\n            self.axis_names = [template2.index.name, template2.columns.name]\n\n            # If index is a RangeIndex, use Int64Index instead\n            ind_type = type(template2.index)\n            try:\n                if isinstance(template2.index, pd.RangeIndex):\n                    ind_type = pd.Int64Index\n            except AttributeError:\n                pass\n            self.partitions = pd.Series([], dtype='O', index=ind_type([]))\n            self.minimum = None\n\n            # check if the given path exists already and create it if it doesn't\n            mkdir(self.path)\n\n            # raise an Exception if it isn't a directory\n            if not isdir(self.path):\n                raise ValueError(\"'path': %s must be a directory\")\n\n            mkdir(self.dirname('meta', 'categories'))\n            self.flush_meta()\n            self.save_partitions()\n        else:\n            raise ValueError(\n                \"must specify a 'template' when creating a new Castra\")\n\n    def _empty_dataframe(self):\n        data = dict((n, pd.Series([], dtype=d, name=n))\n                    for (n, d) in self.dtypes.iteritems())\n        index = pd.Index([], name=self.axis_names[0])\n        columns = pd.Index(self.columns, name=self.axis_names[1])\n        df = pd.DataFrame(data, columns=columns, index=index)\n        return _categorize(self.categories, df)\n\n    def load_meta(self, loads=pickle.loads):\n        for name in self.meta_fields:\n            with open(self.dirname('meta', name), 'rb') as f:\n                setattr(self, name, loads(f.read()))\n\n    def flush_meta(self, dumps=partial(pickle.dumps, protocol=2)):\n        if self._readonly:\n            raise IOError('File not open for writing')\n        for name in self.meta_fields:\n            with open(self.dirname('meta', name), 'wb') as f:\n                f.write(dumps(getattr(self, name)))\n\n    def load_partitions(self, loads=pickle.loads):\n        with open(self.dirname('meta', 'plist'), 'rb') as f:\n            self.partitions = loads(f.read())\n        with open(self.dirname('meta', 'minimum'), 'rb') as f:\n            self.minimum = loads(f.read())\n\n    def save_partitions(self, dumps=partial(pickle.dumps, protocol=2)):\n        if self._readonly:\n            raise IOError('File not open for writing')\n        with open(self.dirname('meta', 'minimum'), 'wb') as f:\n            f.write(dumps(self.minimum))\n        with open(self.dirname('meta', 'plist'), 'wb') as f:\n            f.write(dumps(self.partitions))\n\n    def append_categories(self, new, dumps=partial(pickle.dumps, protocol=2)):\n        if self._readonly:\n            raise IOError('File not open for writing')\n        separator = b'-sep-'\n        for col, cat in new.items():\n            if cat:\n                with open(self.dirname('meta', 'categories', col), 'ab') as f:\n                    f.write(separator.join(map(dumps, cat)))\n                    f.write(separator)\n\n    def load_categories(self, loads=pickle.loads):\n        separator = b'-sep-'\n        self.categories = dict()\n        for col in list(self.columns) + ['.index']:\n            fn = self.dirname('meta', 'categories', col)\n            if os.path.exists(fn):\n                with open(fn, 'rb') as f:\n                    text = f.read()\n                self.categories[col] = [loads(x)\n                                        for x in text.split(separator)[:-1]]\n\n    def extend(self, df):\n        if self._readonly:\n            raise IOError('File not open for writing')\n        if len(df) == 0:\n            return\n        # TODO: Ensure that df is consistent with existing data\n        if not df.index.is_monotonic_increasing:\n            df = df.sort_index(inplace=False)\n\n        new_categories, self.categories, df = _decategorize(self.categories,\n                                                            df)\n        self.append_categories(new_categories)\n\n        if len(self.partitions) and df.index[0] <= self.partitions.index[-1]:\n            if is_trivial_index(df.index):\n                df = df.copy()\n                start = self.partitions.index[-1] + 1\n                new_index = pd.Index(np.arange(start, start + len(df)),\n                                     name = df.index.name)\n                df.index = new_index\n            else:\n                raise ValueError(\"Index of new dataframe less than known data\")\n\n        index = df.index.values\n        partition_name = '--'.join([escape(index.min()), escape(index.max())])\n\n        mkdir(self.dirname(partition_name))\n\n        # Store columns\n        for col in df.columns:\n            pack_file(df[col].values, self.dirname(partition_name, col))\n\n        # Store index\n        fn = self.dirname(partition_name, '.index')\n        bloscpack.pack_ndarray_file(index, fn, bloscpack_args=bp_args,\n                                    blosc_args=blosc_args(index.dtype))\n\n        if not len(self.partitions):\n            self.minimum = coerce_index(index.dtype, index.min())\n        self.partitions.loc[index.max()] = partition_name\n        self.flush()\n\n    def extend_sequence(self, seq, freq=None):\n        \"\"\"Add dataframes from an iterable, optionally repartitioning by freq.\n\n        Parameters\n        ----------\n        seq : iterable\n            An iterable of dataframes\n        freq : frequency, optional\n            A pandas datetime offset. If provided, the dataframes will be\n            partitioned by this frequency.\n        \"\"\"\n        if self._readonly:\n            raise IOError('File not open for writing')\n        if isinstance(freq, str):\n            freq = pd.datetools.to_offset(freq)\n            partitioner = lambda buf, df: partitionby_freq(freq, buf, df)\n        elif freq is None:\n            partitioner = partitionby_none\n        else:\n            raise ValueError(\"Invalid 'freq': {0}\".format(repr(freq)))\n        buf = self._empty_dataframe()\n        for df in seq:\n            write, buf = partitioner(buf, df)\n            for frame in write:\n                self.extend(frame)\n        if buf is not None and not buf.empty:\n            self.extend(buf)\n\n    def dirname(self, *args):\n        return os.path.join(self.path, *list(map(escape, args)))\n\n    def load_partition(self, name, columns, categorize=True):\n        if isinstance(columns, Iterator):\n            columns = list(columns)\n        if '.index' in self.categories and name in self.partitions.index:\n            name = self.categories['.index'].index(name) - 1\n        if not isinstance(columns, list):\n            df = self.load_partition(name, [columns], categorize=categorize)\n            return df.iloc[:, 0]\n        arrays = [unpack_file(self.dirname(name, col)) for col in columns]\n\n        df = pd.DataFrame(dict(zip(columns, arrays)),\n                          columns=pd.Index(columns, name=self.axis_names[1],\n                                           tupleize_cols=False),\n                          index=self.load_index(name))\n        if categorize:\n            df = _categorize(self.categories, df)\n        return df\n\n    def load_index(self, name):\n        return pd.Index(unpack_file(self.dirname(name, '.index')),\n                        dtype=self.index_dtype,\n                        name=self.axis_names[0],\n                        tupleize_cols=False)\n\n    def __getitem__(self, key):\n        if isinstance(key, tuple):\n            key, columns = key\n        else:\n            columns = self.columns\n        if isinstance(columns, slice):\n            columns = self.columns[columns]\n\n        if isinstance(key, slice):\n            start, stop = key.start, key.stop\n        else:\n            start, stop = key, key\n\n        if '.index' in self.categories:\n            if start is not None:\n                start = self.categories['.index'].index(start)\n            if stop is not None:\n                stop = self.categories['.index'].index(stop)\n        key = slice(start, stop)\n\n        names = select_partitions(self.partitions, key)\n\n        if not names:\n            return self._empty_dataframe()[columns]\n\n        data_frames = [self.load_partition(name, columns, categorize=False)\n                       for name in names]\n\n        data_frames[0] = data_frames[0].loc[start:]\n        data_frames[-1] = data_frames[-1].loc[:stop]\n        df = pd.concat(data_frames)\n        df = _categorize(self.categories, df)\n        return df\n\n    def drop(self):\n        if self._readonly:\n            raise IOError('File not open for writing')\n        if os.path.exists(self.path):\n            shutil.rmtree(self.path)\n\n    def flush(self):\n        if self._readonly:\n            raise IOError('File not open for writing')\n        self.save_partitions()\n\n    def __enter__(self):\n        return self\n\n    def __exit__(self, *args):\n        if not self._explicitly_given_path:\n            self.drop()\n        elif not self._readonly:\n            self.flush()\n\n    __del__ = __exit__\n\n    def __getstate__(self):\n        if not self._readonly:\n            self.flush()\n        return (self.path, self._explicitly_given_path, self._readonly)\n\n    def __setstate__(self, state):\n        self.path = state[0]\n        self._explicitly_given_path = state[1]\n        self._readonly = state[2]\n        self.load_meta()\n        self.load_partitions()\n        self.load_categories()\n\n    def to_dask(self, columns=None):\n        import dask.dataframe as dd\n\n        meta = self._empty_dataframe()\n        if columns is None:\n            columns = self.columns\n        else:\n            meta = meta[columns]\n\n        token = md5(str((self.path, os.path.getmtime(self.path))).encode()).hexdigest()\n        name = 'from-castra-' + token\n\n        divisions = [self.minimum] + self.partitions.index.tolist()\n        if '.index' in self.categories:\n            divisions = ([self.categories['.index'][0]]\n                       + [self.categories['.index'][d + 1] for d in divisions[1:-1]]\n                       + [self.categories['.index'][-1]])\n\n        key_parts = list(enumerate(self.partitions.values))\n\n        dsk = dict(((name, i), (Castra.load_partition, self, part, columns))\n                   for i, part in key_parts)\n        if isinstance(columns, list):\n            return dd.DataFrame(dsk, name, meta, divisions)\n        else:\n            return dd.Series(dsk, name, meta, divisions)\n\n\ndef pack_file(x, fn, encoding='utf8'):\n    \"\"\" Pack numpy array into filename\n\n    Supports binary data with bloscpack and text data with msgpack+blosc\n\n    >>> pack_file(np.array([1, 2, 3]), 'foo.blp')  # doctest: +SKIP\n\n    See also:\n        unpack_file\n    \"\"\"\n    if x.dtype != 'O':\n        bloscpack.pack_ndarray_file(x, fn, bloscpack_args=bp_args,\n                blosc_args=blosc_args(x.dtype))\n    else:\n        bytes = blosc.compress(msgpack.packb(x.tolist(), encoding=encoding), 1)\n        with open(fn, 'wb') as f:\n            f.write(bytes)\n\n\ndef unpack_file(fn, encoding='utf8'):\n    \"\"\" Unpack numpy array from filename\n\n    Supports binary data with bloscpack and text data with msgpack+blosc\n\n    >>> unpack_file('foo.blp')  # doctest: +SKIP\n    array([1, 2, 3])\n\n    See also:\n        pack_file\n    \"\"\"\n    try:\n        return bloscpack.unpack_ndarray_file(fn)\n    except ValueError:\n        with open(fn, 'rb') as f:\n            data = msgpack.unpackb(blosc.decompress(f.read()),\n                                   encoding=encoding)\n            return np.array(data, object, copy=False)\n\n\ndef coerce_index(dt, o):\n    if np.issubdtype(dt, np.datetime64):\n        return pd.Timestamp(o)\n    return o\n\n\ndef select_partitions(partitions, key):\n    \"\"\" Select partitions from partition list given slice\n\n    >>> p = pd.Series(['a', 'b', 'c', 'd', 'e'], index=[0, 10, 20, 30, 40])\n    >>> select_partitions(p, slice(3, 25))\n    ['b', 'c', 'd']\n    \"\"\"\n    assert key.step is None, 'step must be None but was %s' % key.step\n    start, stop = key.start, key.stop\n    if start is not None:\n        start = coerce_index(partitions.index.dtype, start)\n        istart = partitions.index.searchsorted(start)\n    else:\n        istart = 0\n    if stop is not None:\n        stop = coerce_index(partitions.index.dtype, stop)\n        istop = partitions.index.searchsorted(stop)\n    else:\n        istop = len(partitions) - 1\n\n    names = partitions.iloc[istart: istop + 1].values.tolist()\n    return names\n\n\ndef _decategorize(categories, df):\n    \"\"\" Strip object dtypes from dataframe, update categories\n\n    Given a DataFrame\n\n    >>> df = pd.DataFrame({'x': [1, 2, 3], 'y': ['C', 'B', 'B']})\n\n    And a dict of known categories\n\n    >>> _ = categories = {'y': ['A', 'B']}\n\n    Update dict and dataframe in place\n\n    >>> extra, categories, df = _decategorize(categories, df)\n    >>> extra\n    {'y': ['C']}\n    >>> categories\n    {'y': ['A', 'B', 'C']}\n    >>> df\n       x  y\n    0  1  2\n    1  2  1\n    2  3  1\n    \"\"\"\n    extra = dict()\n    new_categories = dict()\n    new_columns = dict((col, df[col].values) for col in df.columns)\n    for col, cat in categories.items():\n        if col == '.index' or col not in df.columns:\n            continue\n        idx = pd.Index(df[col])\n        idx = getattr(idx, 'categories', idx)\n        ex = idx[~idx.isin(cat)].unique()\n        if any(pd.isnull(c) for c in cat):\n            ex = ex[~pd.isnull(ex)]\n        extra[col] = ex.tolist()\n        new_categories[col] = cat + extra[col]\n        new_columns[col] = pd.Categorical(df[col].values, new_categories[col]).codes\n\n    if '.index' in categories:\n        idx = df.index\n        idx = getattr(idx, 'categories', idx)\n        ex = idx[~idx.isin(cat)].unique()\n        if any(pd.isnull(c) for c in cat):\n            ex = ex[~pd.isnull(ex)]\n        extra['.index'] = ex.tolist()\n        new_categories['.index'] = cat + extra['.index']\n\n        new_index = pd.Categorical(df.index, new_categories['.index']).codes\n        new_index = pd.Index(new_index, name=df.index.name)\n    else:\n        new_index = df.index\n\n    new_df = pd.DataFrame(new_columns, columns=df.columns, index=new_index)\n    return extra, new_categories, new_df\n\n\ndef make_categorical(s, categories):\n    name = '.index' if isinstance(s, pd.Index) else s.name\n    if name in categories:\n        idx = pd.Index(categories[name], tupleize_cols=False, dtype='object')\n        idx.is_unique = True\n        cat = pd.Categorical(s.values, categories=idx, fastpath=True, ordered=False)\n        return pd.CategoricalIndex(cat, name=s.name, ordered=True) if name == '.index' else cat\n    return s if name == '.index' else s.values\n\n\n\ndef _categorize(categories, df):\n    \"\"\" Categorize columns in dataframe\n\n    >>> df = pd.DataFrame({'x': [1, 2, 3], 'y': [0, 2, 0]})\n    >>> categories = {'y': ['A', 'B', 'c']}\n    >>> _categorize(categories, df)\n       x  y\n    0  1  A\n    1  2  c\n    2  3  A\n    \"\"\"\n    if isinstance(df, pd.Series):\n        return pd.Series(make_categorical(df, categories),\n                         index=make_categorical(df.index, categories),\n                         name=df.name)\n    else:\n        return pd.DataFrame(dict((col, make_categorical(df[col], categories))\n                                 for col in df.columns),\n                            columns=df.columns,\n                            index=make_categorical(df.index, categories))\n\n\ndef partitionby_none(buf, new):\n    \"\"\"Repartition to ensure partitions don't split duplicate indices\"\"\"\n    if new.empty:\n        return [], buf\n    elif buf.empty:\n        return [], new\n    if not new.index.is_monotonic_increasing:\n        new = new.sort_index(inplace=False)\n    end = buf.index[-1]\n    if end >= new.index[0] and not is_trivial_index(new.index):\n        i = new.index.searchsorted(end, side='right')\n        # Only need to concat, `castra.extend` will resort if needed\n        buf = pd.concat([buf, new.iloc[:i]])\n        new = new.iloc[i:]\n    return [buf], new\n\n\ndef partitionby_freq(freq, buf, new):\n    \"\"\"Partition frames into blocks by a freq\"\"\"\n    df = pd.concat([buf, new])\n    if not df.index.is_monotonic_increasing:\n        df = df.sort_index(inplace=False)\n    start, end = pd.tseries.resample._get_range_edges(df.index[0],\n                                                      df.index[-1], freq)\n    inds = [df.index.searchsorted(i) for i in\n            pd.date_range(start, end, freq=freq)[1:]]\n    slices = [(inds[i-1], inds[i]) if i else (0, inds[i]) for i in\n              range(len(inds))]\n    frames = [df.iloc[i:j] for (i, j) in slices]\n    return frames[:-1], frames[-1]\n\n\ndef is_trivial_index(ind):\n    \"\"\" Is this index just 0..n ?\n\n    If so then we can probably ignore or change it around as necessary\n\n    >>> is_trivial_index(pd.Index([0, 1, 2]))\n    True\n\n    >>> is_trivial_index(pd.Index([0, 3, 5]))\n    False\n    \"\"\"\n    return ind[0] == 0 and (ind == np.arange(len(ind))).all()\n"
  },
  {
    "path": "castra/tests/test_core.py",
    "content": "import os\nimport tempfile\nimport pickle\nimport shutil\n\nimport pandas as pd\nimport pandas.util.testing as tm\n\nimport pytest\n\nimport numpy as np\n\nfrom castra import Castra\nfrom castra.core import mkdir, select_partitions, _decategorize, _categorize\n\n\nA = pd.DataFrame({'x': [1, 2],\n                  'y': [1., 2.]},\n                 columns=['x', 'y'],\n                 index=[1, 2])\n\nB = pd.DataFrame({'x': [10, 20],\n                  'y': [10., 20.]},\n                 columns=['x', 'y'],\n                 index=[10, 20])\n\n\nC = pd.DataFrame({'x': [10, 20],\n                  'y': [10., 20.],\n                  'z': [0, 1]},\n                 columns=['x', 'y', 'z']).set_index('z')\nC.columns.name = 'cols'\n\n\n@pytest.yield_fixture\ndef base():\n    d = tempfile.mkdtemp(prefix='castra-')\n    try:\n        yield d\n    finally:\n        shutil.rmtree(d)\n\n\ndef test_safe_mkdir_with_new(base):\n    path = os.path.join(base, 'db')\n    mkdir(path)\n    assert os.path.exists(path)\n    assert os.path.isdir(path)\n\n\ndef test_safe_mkdir_with_existing(base):\n    # an existing path should not raise an exception\n    mkdir(base)\n\n\ndef test_create_with_random_directory():\n    Castra(template=A)\n\n\ndef test_create_with_non_existing_path(base):\n    path = os.path.join(base, 'db')\n    Castra(path=path, template=A)\n\n\ndef test_create_with_existing_path(base):\n    Castra(path=base, template=A)\n\n\ndef test_get_empty(base):\n    df = Castra(path=base, template=A)[:]\n    assert (df.columns == A.columns).all()\n\n\ndef test_get_empty_result(base):\n    c = Castra(path=base, template=A)\n    c.extend(A)\n\n    df = c[100:200]\n\n    assert (df.columns == A.columns).all()\n\n\ndef test_get_slice(base):\n    c = Castra(path=base, template=A)\n    c.extend(A)\n\n    tm.assert_frame_equal(c[:], c[:, :])\n    tm.assert_frame_equal(c[:, 1:], c[:][['y']])\n\n\ndef test_exception_with_non_dir(base):\n    file_ = os.path.join(base, 'file')\n    with open(file_, 'w') as f:\n        f.write('file')\n    with pytest.raises(ValueError):\n        Castra(file_)\n\n\ndef test_exception_with_existing_castra_and_template(base):\n    with Castra(path=base, template=A) as c:\n        c.extend(A)\n    with pytest.raises(ValueError):\n        Castra(path=base, template=A)\n\n\ndef test_exception_with_empty_dir_and_no_template(base):\n    with pytest.raises(ValueError):\n        Castra(path=base)\n\n\ndef test_load(base):\n    with Castra(path=base, template=A) as c:\n        c.extend(A)\n        c.extend(B)\n\n    loaded = Castra(path=base)\n    tm.assert_frame_equal(pd.concat([A, B]), loaded[:])\n\n\ndef test_del_with_random_dir():\n    c = Castra(template=A)\n    assert os.path.exists(c.path)\n    c.__del__()\n    assert not os.path.exists(c.path)\n\n\ndef test_context_manager_with_random_dir():\n    with Castra(template=A) as c:\n        assert os.path.exists(c.path)\n    assert not os.path.exists(c.path)\n\n\ndef test_context_manager_with_specific_dir(base):\n    with Castra(path=base, template=A) as c:\n        assert os.path.exists(c.path)\n    assert os.path.exists(c.path)\n\n\ndef test_timeseries():\n    indices = [pd.DatetimeIndex(start=str(i), end=str(i+1), freq='w')\n               for i in range(2000, 2015)]\n    dfs = [pd.DataFrame({'x': list(range(len(ind)))}, ind).iloc[:-1]\n           for ind in indices]\n\n    with Castra(template=dfs[0]) as c:\n        for df in dfs:\n            c.extend(df)\n        df = c['2010-05': '2013-02']\n        assert len(df) > 100\n\n\ndef test_Castra():\n    c = Castra(template=A)\n    c.extend(A)\n    c.extend(B)\n\n    assert c.columns == ['x', 'y']\n\n    tm.assert_frame_equal(c[0:100], pd.concat([A, B]))\n    tm.assert_frame_equal(c[:5], A)\n    tm.assert_frame_equal(c[5:], B)\n\n    tm.assert_frame_equal(c[2:5], A[1:])\n    tm.assert_frame_equal(c[2:15], pd.concat([A[1:], B[:1]]))\n\n\ndef test_pickle_Castra():\n    path = tempfile.mkdtemp(prefix='castra-')\n    c = Castra(path=path, template=A)\n    c.extend(A)\n    c.extend(B)\n\n    dumped = pickle.dumps(c)\n    undumped = pickle.loads(dumped)\n\n    tm.assert_frame_equal(pd.concat([A, B]), undumped[:])\n\n\ndef test_text():\n    df = pd.DataFrame({'name': ['Alice', 'Bob'],\n                       'balance': [100, 200]}, columns=['name', 'balance'])\n    with Castra(template=df) as c:\n        c.extend(df)\n\n        tm.assert_frame_equal(c[:], df)\n\n\ndef test_column_access():\n    with Castra(template=A) as c:\n        c.extend(A)\n        c.extend(B)\n        df = c[:, ['x']]\n\n        tm.assert_frame_equal(df, pd.concat([A[['x']], B[['x']]]))\n\n        df = c[:, 'x']\n        tm.assert_series_equal(df, pd.concat([A.x, B.x]))\n\n\ndef test_reload():\n    path = tempfile.mkdtemp(prefix='castra-')\n    try:\n        c = Castra(template=A, path=path)\n        c.extend(A)\n\n        d = Castra(path=path)\n\n        assert c.columns == d.columns\n        assert (c.partitions == d.partitions).all()\n        assert c.minimum == d.minimum\n    finally:\n        shutil.rmtree(path)\n\n\ndef test_readonly():\n    path = tempfile.mkdtemp(prefix='castra-')\n    try:\n        c = Castra(path=path, template=A)\n        c.extend(A)\n        d = Castra(path=path, readonly=True)\n        with pytest.raises(IOError):\n            d.extend(B)\n        with pytest.raises(IOError):\n            d.extend_sequence([B])\n        with pytest.raises(IOError):\n            d.flush()\n        with pytest.raises(IOError):\n            d.drop()\n        with pytest.raises(IOError):\n            d.save_partitions()\n        with pytest.raises(IOError):\n            d.flush_meta()\n        assert c.columns == d.columns\n        assert (c.partitions == d.partitions).all()\n        assert c.minimum == d.minimum\n    finally:\n        shutil.rmtree(path)\n\n\ndef test_index_dtype_matches_template():\n    with Castra(template=A) as c:\n        assert c.partitions.index.dtype == A.index.dtype\n\n\ndef test_to_dask_dataframe():\n    pytest.importorskip('dask.dataframe')\n\n    try:\n        import dask.dataframe as dd\n    except ImportError:\n        return\n\n    with Castra(template=A) as c:\n        c.extend(A)\n        c.extend(B)\n\n        df = c.to_dask()\n        assert isinstance(df, dd.DataFrame)\n        assert list(df.divisions) == [1, 2, 20]\n        tm.assert_frame_equal(df.compute(), c[:])\n\n        df = c.to_dask('x')\n        assert isinstance(df, dd.Series)\n        assert list(df.divisions) == [1, 2, 20]\n        tm.assert_series_equal(df.compute(), c[:, 'x'])\n\n\ndef test_categorize():\n    A = pd.DataFrame({'x': [1, 2, 3], 'y': ['A', None, 'A']},\n                     columns=['x', 'y'], index=[0, 10, 20])\n    B = pd.DataFrame({'x': [4, 5, 6], 'y': ['C', None, 'A']},\n                     columns=['x', 'y'], index=[30, 40, 50])\n\n    with Castra(template=A, categories=['y']) as c:\n        c.extend(A)\n        assert c[:].dtypes['y'] == 'category'\n        assert c[:]['y'].cat.codes.dtype == np.dtype('i1')\n        assert list(c[:, 'y'].cat.categories) == ['A', None]\n\n        c.extend(B)\n        assert list(c[:, 'y'].cat.categories) == ['A', None, 'C']\n\n        assert c.load_partition(c.partitions.iloc[0], 'y').dtype == 'category'\n\n        c.flush()\n\n        d = Castra(path=c.path)\n        tm.assert_frame_equal(c[:], d[:])\n\n\ndef test_save_axis_names():\n    with Castra(template=C) as c:\n        c.extend(C)\n        assert c[:].index.name == 'z'\n        assert c[:].columns.name == 'cols'\n        tm.assert_frame_equal(c[:], C)\n\n\ndef test_same_categories_when_already_categorized():\n    A = pd.DataFrame({'x': [1, 2] * 1000,\n                      'y': [1., 2.] * 1000,\n                      'z': np.random.choice(list('abc'), size=2000)},\n                     columns=list('xyz'))\n    A['z'] = A.z.astype('category')\n    with Castra(template=A, categories=['z']) as c:\n        c.extend(A)\n        assert c.categories['z'] == A.z.cat.categories.tolist()\n\n\ndef test_category_dtype():\n    A = pd.DataFrame({'x': [1, 2] * 3,\n                      'y': [1., 2.] * 3,\n                      'z': list('abcabc')},\n                     columns=list('xyz'))\n    with Castra(template=A, categories=['z']) as c:\n        c.extend(A)\n        assert A.dtypes['z'] == 'object'\n\n\ndef test_do_not_create_dirs_if_template_fails():\n    A = pd.DataFrame({'x': [1, 2] * 3,\n                      'y': [1., 2.] * 3,\n                      'z': list('abcabc')},\n                     columns=list('xyz'))\n    with pytest.raises(ValueError):\n        Castra(template=A, path='foo', categories=['w'])\n    assert not os.path.exists('foo')\n\n\ndef test_sort_on_extend():\n    df = pd.DataFrame({'x': [1, 2, 3]}, index=[3, 2, 1])\n    expected = pd.DataFrame({'x': [3, 2, 1]}, index=[1, 2, 3])\n    with Castra(template=df) as c:\n        c.extend(df)\n        tm.assert_frame_equal(c[:], expected)\n\n\ndef test_select_partitions():\n    p = pd.Series(['a', 'b', 'c', 'd', 'e'], index=[0, 10, 20, 30, 40])\n    assert select_partitions(p, slice(3, 25)) == ['b', 'c', 'd']\n    assert select_partitions(p, slice(None, 25)) == ['a', 'b', 'c', 'd']\n    assert select_partitions(p, slice(3, None)) == ['b', 'c', 'd', 'e']\n    assert select_partitions(p, slice(None, None)) == ['a', 'b', 'c', 'd', 'e']\n    assert select_partitions(p, slice(10, 30)) == ['b', 'c', 'd']\n\n\ndef test_first_index_is_timestamp():\n    pytest.importorskip('dask.dataframe')\n\n    df = pd.DataFrame({'x': [1, 2] * 3,\n                       'y': [1., 2.] * 3,\n                       'z': list('abcabc')},\n                      columns=list('xyz'),\n                      index=pd.date_range(start='20120101', periods=6))\n    with Castra(template=df) as c:\n        c.extend(df)\n\n        assert isinstance(c.minimum, pd.Timestamp)\n        assert isinstance(c.to_dask().divisions[0], pd.Timestamp)\n\n\ndef test_minimum_dtype():\n    df = tm.makeTimeDataFrame()\n\n    with Castra(template=df) as c:\n        c.extend(df)\n        assert type(c.minimum) == type(c.partitions.index[0])\n\n\ndef test_many_default_indexes():\n    a = pd.DataFrame({'x': [1, 2, 3]})\n    b = pd.DataFrame({'x': [4, 5, 6]})\n    c = pd.DataFrame({'x': [7, 8, 9]})\n\n    e = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6, 7, 8, 9]})\n\n    with Castra(template=a) as C:\n        C.extend(a)\n        C.extend(b)\n        C.extend(c)\n\n        tm.assert_frame_equal(C[:], e)\n\n\ndef test_raise_error_on_mismatched_index():\n    x = pd.DataFrame({'x': [1, 2, 3]}, index=[1, 2, 3])\n    y = pd.DataFrame({'x': [1, 2, 3]}, index=[4, 5, 6])\n    z = pd.DataFrame({'x': [4, 5, 6]}, index=[5, 6, 7])\n\n    with Castra(template=x) as c:\n        c.extend(x)\n        c.extend(y)\n\n        with pytest.raises(ValueError):\n            c.extend(z)\n\n\ndef test_raise_error_on_equal_index():\n    a = pd.DataFrame({'x': [1, 2, 3]}, index=[1, 2, 3])\n    b = pd.DataFrame({'x': [4, 5, 6]}, index=[3, 4, 5])\n\n    with Castra(template=a) as c:\n        c.extend(a)\n\n        with pytest.raises(ValueError):\n            c.extend(b)\n\n\ndef test_categories_nan():\n    a = pd.DataFrame({'x': ['A', np.nan]})\n    b = pd.DataFrame({'x': ['B', np.nan]})\n\n    with Castra(template=a, categories=['x']) as c:\n        c.extend(a)\n        c.extend(b)\n        assert len(c.categories['x']) == 3\n\n\ndef test_extend_sequence_freq():\n    df = pd.util.testing.makeTimeDataFrame(1000, 'min')\n    seq = [df.iloc[i:i+100] for i in range(0,1000,100)]\n    with Castra(template=df) as c:\n        c.extend_sequence(seq, freq='h')\n        tm.assert_frame_equal(c[:], df)\n        parts = pd.date_range(start=df.index[59], freq='h',\n                              periods=16).insert(17, df.index[-1])\n        tm.assert_index_equal(c.partitions.index, parts)\n\n    with Castra(template=df) as c:\n        c.extend_sequence(seq, freq='d')\n        tm.assert_frame_equal(c[:], df)\n        assert len(c.partitions) == 1\n\n\ndef test_extend_sequence_none():\n    data = {'a': range(5), 'b': range(5)}\n    p1 = pd.DataFrame(data, index=[1, 2, 3, 4, 5])\n    p2 = pd.DataFrame(data, index=[5, 5, 5, 6, 7])\n    p3 = pd.DataFrame(data, index=[7, 9, 10, 11, 12])\n    seq = [p1, p2, p3]\n    df = pd.concat(seq)\n    with Castra(template=df) as c:\n        c.extend_sequence(seq)\n        tm.assert_frame_equal(c[:], df)\n        assert len(c.partitions) == 3\n        assert len(c.load_partition('1--5', ['a', 'b']).index) == 8\n        assert len(c.load_partition('6--7', ['a', 'b']).index) == 3\n        assert len(c.load_partition('9--12', ['a', 'b']).index) == 4\n\n\ndef test_extend_sequence_overlap():\n    df = pd.util.testing.makeTimeDataFrame(20, 'min')\n    p1 = df.iloc[:15]\n    p2 = df.iloc[10:20]\n    seq = [p1,p2]\n    df = pd.concat(seq)\n    with Castra(template=df) as c:\n        c.extend_sequence(seq)\n        tm.assert_frame_equal(c[:], df.sort_index())\n        assert (c.partitions.index == [p.index[-1] for p in seq]).all()\n    # Check with trivial index\n    p1 = pd.DataFrame({'a': range(10), 'b': range(10)})\n    p2 = pd.DataFrame({'a': range(10, 17), 'b': range(10, 17)})\n    seq = [p1,p2]\n    df = pd.DataFrame({'a': range(17), 'b': range(17)})\n    with Castra(template=df) as c:\n        c.extend_sequence(seq)\n        tm.assert_frame_equal(c[:], df)\n        assert (c.partitions.index == [9, 16]).all()\n\n\ndef test_extend_sequence_single_frame():\n    df = pd.util.testing.makeTimeDataFrame(100, 'h')\n    seq = [df]\n    with Castra(template=df) as c:\n        c.extend_sequence(seq, freq='d')\n        assert (c.partitions.index == ['2000-01-01 23:00:00', '2000-01-02 23:00:00',\n                 '2000-01-03 23:00:00', '2000-01-04 23:00:00', '2000-01-05 03:00:00']).all()\n    df = pd.DataFrame({'a': range(10), 'b': range(10)})\n    seq = [df]\n    with Castra(template=df) as c:\n        c.extend_sequence(seq)\n        tm.assert_frame_equal(c[:], df)\n\n\ndef test_column_with_period():\n    df = pd.DataFrame({'x': [10, 20],\n                       '.': [10., 20.]},\n                       columns=['x', '.'],\n                       index=[10, 20])\n\n    with Castra(template=df) as c:\n        c.extend(df)\n\n\ndef test_empty():\n    with Castra(template=A) as c:\n        c.extend(pd.DataFrame(columns=A.columns))\n        assert len(c[:]) == 0\n\n\ndef test_index_with_single_value():\n    df = pd.DataFrame({'x': [1, 2, 3]}, index=[1, 1, 2])\n    with Castra(template=df) as c:\n        c.extend(df)\n\n        tm.assert_frame_equal(c[1], df.loc[1])\n\n\ndef test_categorical_index():\n    df = pd.DataFrame({'x': [1, 2, 3]},\n            index=pd.CategoricalIndex(['a', 'a', 'b'], ordered=True, name='foo'))\n\n    with Castra(template=df, categories=True) as c:\n        c.extend(df)\n        result = c[:]\n        tm.assert_frame_equal(c[:], df)\n\n    A = pd.DataFrame({'x': [1, 2, 3]},\n                    index=pd.Index(['a', 'a', 'b'], name='foo'))\n    B = pd.DataFrame({'x': [4, 5, 6]},\n                    index=pd.Index(['c', 'd', 'd'], name='foo'))\n\n    path = tempfile.mkdtemp(prefix='castra-')\n    try:\n        with Castra(path=path, template=A, categories=['foo']) as c:\n            c.extend(A)\n            c.extend(B)\n\n            c2 = Castra(path=path)\n            result = c2[:]\n\n            expected = pd.concat([A, B])\n            expected.index = pd.CategoricalIndex(expected.index,\n                    name=expected.index.name, ordered=True)\n            tm.assert_frame_equal(result, expected)\n\n            tm.assert_frame_equal(c['a'], expected.loc['a'])\n    finally:\n        shutil.rmtree(path)\n\n\ndef test_categorical_index_with_dask_dataframe():\n    pytest.importorskip('dask.dataframe')\n    import dask.dataframe as dd\n    import dask\n\n    A = pd.DataFrame({'x': [1, 2, 3, 4]},\n                    index=pd.Index(['a', 'a', 'b', 'b'], name='foo'))\n    B = pd.DataFrame({'x': [4, 5, 6]},\n                    index=pd.Index(['c', 'd', 'd'], name='foo'))\n\n\n    path = tempfile.mkdtemp(prefix='castra-')\n    try:\n        with Castra(path=path, template=A, categories=['foo']) as c:\n            c.extend(A)\n            c.extend(B)\n\n            df = dd.from_castra(path)\n            assert df.divisions == ('a', 'c', 'd')\n\n            result = df.compute(get=dask.async.get_sync)\n\n            expected = pd.concat([A, B])\n            expected.index = pd.CategoricalIndex(expected.index,\n                    name=expected.index.name, ordered=True)\n\n            tm.assert_frame_equal(result, expected)\n\n            tm.assert_frame_equal(df.loc['a'].compute(), expected.loc['a'])\n            tm.assert_frame_equal(df.loc['b'].compute(get=dask.async.get_sync),\n                                  expected.loc['b'])\n    finally:\n        shutil.rmtree(path)\n\n\ndef test__decategorize():\n    df = pd.DataFrame({'x': [1, 2, 3]},\n                      index=pd.CategoricalIndex(['a', 'a', 'b'], ordered=True,\n                          name='foo'))\n\n    extra, categories, df2 = _decategorize({'.index': []}, df)\n\n    assert (df2.index == [0, 0, 1]).all()\n\n    df3 = _categorize(categories, df2)\n\n    tm.assert_frame_equal(df, df3)\n"
  },
  {
    "path": "requirements.txt",
    "content": "numpy\npandas\nbloscpack>=0.8.0\nblosc\n"
  },
  {
    "path": "setup.py",
    "content": "#!/usr/bin/env python\n\nfrom os.path import exists\nfrom setuptools import setup\n\nsetup(name='castra',\n      version='0.1.8',\n      description='On-disk partitioned store',\n      url='http://github.com/blaze/Castra/',\n      maintainer='Matthew Rocklin',\n      maintainer_email='mrocklin@gmail.com',\n      license='BSD',\n      keywords='',\n      packages=['castra'],\n      package_data={'castra': ['tests/*.py']},\n      install_requires=list(open('requirements.txt').read().strip().split('\\n')),\n      long_description=(open('README.rst').read() if exists('README.rst')\n                        else ''),\n      zip_safe=False)\n"
  }
]