Repository: blaze/castra Branch: master Commit: 1ae53dfcafdd Files: 10 Total size: 44.4 KB Directory structure: gitextract_9pxf5njk/ ├── .gitignore ├── .travis.yml ├── LICENSE ├── MANIFEST.in ├── README.rst ├── castra/ │ ├── __init__.py │ ├── core.py │ └── tests/ │ └── test_core.py ├── requirements.txt └── setup.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitignore ================================================ *.pyc ================================================ FILE: .travis.yml ================================================ sudo: False language: python matrix: include: - python: 2.7 - python: 3.3 - python: 3.4 - python: 3.5 install: # Install conda - wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh - bash miniconda.sh -b -p $HOME/miniconda - export PATH="$HOME/miniconda/bin:$PATH" - conda config --set always_yes yes --set changeps1 no - conda update conda # Install dependencies - conda create -n castra python=$TRAVIS_PYTHON_VERSION pytest numpy pandas dask - source activate castra - pip install blosc - pip install bloscpack - pip install dask --upgrade script: - py.test -x --doctest-modules --pyargs castra notifications: email: false ================================================ FILE: LICENSE ================================================ Copyright (c) 2015, Continuum Analytics, Inc. Copyright (c) 2015, Valentin Haenel All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of Continuum Analytics nor the names of any contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ================================================ FILE: MANIFEST.in ================================================ recursive-include castra *.py recursive-include docs *.rst include setup.py include README.rst include LICENSE.txt include requirements.txt include MANIFEST.in prune docs/_build ================================================ FILE: README.rst ================================================ Castra ====== |Build Status| Castra is an on-disk, partitioned, compressed, column store. Castra provides efficient columnar range queries. * **Efficient on-disk:** Castra stores data on your hard drive in a way that you can load it quickly, increasing the comfort of inconveniently large data. * **Partitioned:** Castra partitions your data along an index, allowing rapid loads of ranges of data like "All records between January and March" * **Compressed:** Castra uses Blosc_ to compress data, increasing effective disk bandwidth and decreasing storage costs * **Column-store:** Castra stores columns separately, drastically reducing I/O costs for analytic queries * **Tabular data:** Castra plays well with Pandas and is an ideal fit for append-only applications like time-series Maintenance ----------- This project is no longer actively maintained. Use at your own risk. Example ------- Consider some Pandas DataFrames .. code-block:: python In [1]: import pandas as pd In [2]: A = pd.DataFrame({'price': [10.0, 11.0], 'volume': [100, 200]}, ...: index=pd.DatetimeIndex(['2010', '2011'])) In [3]: B = pd.DataFrame({'price': [12.0, 13.0], 'volume': [300, 400]}, ...: index=pd.DatetimeIndex(['2012', '2013'])) We create a Castra with a filename and a template dataframe from which to get column name, index, and dtype information .. code-block:: python In [4]: from castra import Castra In [5]: c = Castra('data.castra', template=A) The castra starts empty but we can extend it with new dataframes: .. code-block:: python In [6]: c.extend(A) In [7]: c[:] Out[7]: price volume 2010-01-01 10 100 2011-01-01 11 200 In [8]: c.extend(B) In [9]: c[:] Out[9]: price volume 2010-01-01 10 100 2011-01-01 11 200 2012-01-01 12 300 2013-01-01 13 400 We can select particular columns .. code-block:: python In [10]: c[:, 'price'] Out[10]: 2010-01-01 10 2011-01-01 11 2012-01-01 12 2013-01-01 13 Name: price, dtype: float64 Particular ranges .. code-block:: python In [12]: c['2011':'2013'] Out[12]: price volume 2011-01-01 11 200 2012-01-01 12 300 2013-01-01 13 400 Or both .. code-block:: python In [13]: c['2011':'2013', 'volume'] Out[13]: 2011-01-01 200 2012-01-01 300 2013-01-01 400 Name: volume, dtype: int64 Storage ------- Castra stores your dataframes as they arrived, you can see the divisions along which you data is divided. .. code-block:: python In [14]: c.partitions Out[14]: 2011-01-01 2009-12-31T16:00:00.000000000-0800--2010-12-31... 2013-01-01 2011-12-31T16:00:00.000000000-0800--2012-12-31... dtype: object Each column in each partition lives in a separate compressed file:: $ ls -a data.castra/2011-12-31T16:00:00.000000000-0800--2012-12-31T16:00:00.000000000-0800 . .. .index price volume Restrictions ------------ Castra is both fast and restrictive. * You must always give it dataframes that match its template (same column names, index type, dtypes). * You can only give castra dataframes with **increasing index values**. For example you can give it one dataframe a day for values on that day. You can not go back and update previous days. Text and Categoricals --------------------- Castra tries to encode text and object dtype columns with msgpack_, using the implementation found in the Pandas library. It falls back to `pickle` with a high protocol if that fails. Alternatively, Castra can categorize your data as it receives it .. code-block:: python >>> c = Castra('data.castra', template=df, categories=['list', 'of', 'columns']) or >>> c = Castra('data.castra', template=df, categories=True) # all object dtype columns Categorizing columns that have repetitive text, like ``'sex'`` or ``'ticker-symbol'`` can greatly improve both read times and computational performance with Pandas. See this blogpost_ for more information. .. _msgpack: http://msgpack.org/index.html Dask dataframe -------------- Castra interoperates smoothly with dask.dataframe_ .. code-block:: python >>> import dask.dataframe as dd >>> df = dd.read_csv('myfiles.*.csv') >>> df.set_index('timestamp', compute=False).to_castra('myfile.castra', categories=True) >>> df = dd.from_castra('myfile.castra') Work in Progress ---------------- Castra is immature and largely for experimental use. The developers do not promise backwards compatibility with future versions. You should treat castra as a very efficient temporary format and archive your data with some other system. .. _Blosc: https://github.com/Blosc .. _dask.dataframe: https://dask.pydata.org/en/latest/dataframe.html .. _blogpost: http://matthewrocklin.com/blog/work/2015/06/18/Categoricals .. |Build Status| image:: https://travis-ci.org/blaze/castra.svg :target: https://travis-ci.org/blaze/castra ================================================ FILE: castra/__init__.py ================================================ from .core import Castra __version__ = '0.1.8' ================================================ FILE: castra/core.py ================================================ from collections import Iterator import os from os.path import exists, isdir try: import cPickle as pickle except ImportError: import pickle import shutil import tempfile from hashlib import md5 from functools import partial import blosc import bloscpack import numpy as np import pandas as pd from pandas import msgpack bp_args = bloscpack.BloscpackArgs(offsets=False, checksum='None') def blosc_args(dt): if np.issubdtype(dt, int): return bloscpack.BloscArgs(dt.itemsize, clevel=3, shuffle=True) if np.issubdtype(dt, np.datetime64): return bloscpack.BloscArgs(dt.itemsize, clevel=3, shuffle=True) if np.issubdtype(dt, float): return bloscpack.BloscArgs(dt.itemsize, clevel=1, shuffle=False) return None # http://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename-in-python import string valid_chars = "-_%s%s" % (string.ascii_letters, string.digits) def escape(text): """ >>> escape("Hello!") # Remove punctuation from names 'Hello' >>> escape("/!.") # completely invalid names produce hash string 'cb6698330c63e87fc35933a0474238b0' """ result = ''.join(c for c in str(text) if c in valid_chars) if not result: result = md5(str(text).encode()).hexdigest() return result def mkdir(path): if not exists(path): os.makedirs(path) class Castra(object): meta_fields = ['columns', 'dtypes', 'index_dtype', 'axis_names'] def __init__(self, path=None, template=None, categories=None, readonly=False): self._readonly = readonly # check if we should create a random path self._explicitly_given_path = path is not None if not self._explicitly_given_path: self.path = tempfile.mkdtemp(prefix='castra-') else: self.path = path # either we have a meta directory if isdir(self.dirname('meta')): if template is not None: raise ValueError( "Opening a castra with a template, yet this castra\n" "already exists. Filename: %s" % self.path) self.load_meta() self.load_partitions() self.load_categories() # or we don't, in which case we need a template elif template is not None: if self._readonly: ValueError("Can't create new castra in readonly mode") if isinstance(categories, (list, tuple)): if template.index.name in categories: categories.remove(template.index.name) categories.append('.index') self.categories = dict((col, []) for col in categories) elif categories is True: self.categories = dict((col, []) for col in template.columns if template.dtypes[col] == 'object') if isinstance(template.index, pd.CategoricalIndex): self.categories['.index'] = [] else: self.categories = dict() if self.categories: categories = set(self.categories) template_categories = set(template.dtypes.index.values) if categories.difference(template_categories) - set(['.index']): raise ValueError('passed in categories %s are not all ' 'contained in template dataframe columns ' '%s' % (categories, template_categories)) template2 = _decategorize(self.categories, template)[2] self.columns, self.dtypes, self.index_dtype = \ list(template2.columns), template2.dtypes, template2.index.dtype self.axis_names = [template2.index.name, template2.columns.name] # If index is a RangeIndex, use Int64Index instead ind_type = type(template2.index) try: if isinstance(template2.index, pd.RangeIndex): ind_type = pd.Int64Index except AttributeError: pass self.partitions = pd.Series([], dtype='O', index=ind_type([])) self.minimum = None # check if the given path exists already and create it if it doesn't mkdir(self.path) # raise an Exception if it isn't a directory if not isdir(self.path): raise ValueError("'path': %s must be a directory") mkdir(self.dirname('meta', 'categories')) self.flush_meta() self.save_partitions() else: raise ValueError( "must specify a 'template' when creating a new Castra") def _empty_dataframe(self): data = dict((n, pd.Series([], dtype=d, name=n)) for (n, d) in self.dtypes.iteritems()) index = pd.Index([], name=self.axis_names[0]) columns = pd.Index(self.columns, name=self.axis_names[1]) df = pd.DataFrame(data, columns=columns, index=index) return _categorize(self.categories, df) def load_meta(self, loads=pickle.loads): for name in self.meta_fields: with open(self.dirname('meta', name), 'rb') as f: setattr(self, name, loads(f.read())) def flush_meta(self, dumps=partial(pickle.dumps, protocol=2)): if self._readonly: raise IOError('File not open for writing') for name in self.meta_fields: with open(self.dirname('meta', name), 'wb') as f: f.write(dumps(getattr(self, name))) def load_partitions(self, loads=pickle.loads): with open(self.dirname('meta', 'plist'), 'rb') as f: self.partitions = loads(f.read()) with open(self.dirname('meta', 'minimum'), 'rb') as f: self.minimum = loads(f.read()) def save_partitions(self, dumps=partial(pickle.dumps, protocol=2)): if self._readonly: raise IOError('File not open for writing') with open(self.dirname('meta', 'minimum'), 'wb') as f: f.write(dumps(self.minimum)) with open(self.dirname('meta', 'plist'), 'wb') as f: f.write(dumps(self.partitions)) def append_categories(self, new, dumps=partial(pickle.dumps, protocol=2)): if self._readonly: raise IOError('File not open for writing') separator = b'-sep-' for col, cat in new.items(): if cat: with open(self.dirname('meta', 'categories', col), 'ab') as f: f.write(separator.join(map(dumps, cat))) f.write(separator) def load_categories(self, loads=pickle.loads): separator = b'-sep-' self.categories = dict() for col in list(self.columns) + ['.index']: fn = self.dirname('meta', 'categories', col) if os.path.exists(fn): with open(fn, 'rb') as f: text = f.read() self.categories[col] = [loads(x) for x in text.split(separator)[:-1]] def extend(self, df): if self._readonly: raise IOError('File not open for writing') if len(df) == 0: return # TODO: Ensure that df is consistent with existing data if not df.index.is_monotonic_increasing: df = df.sort_index(inplace=False) new_categories, self.categories, df = _decategorize(self.categories, df) self.append_categories(new_categories) if len(self.partitions) and df.index[0] <= self.partitions.index[-1]: if is_trivial_index(df.index): df = df.copy() start = self.partitions.index[-1] + 1 new_index = pd.Index(np.arange(start, start + len(df)), name = df.index.name) df.index = new_index else: raise ValueError("Index of new dataframe less than known data") index = df.index.values partition_name = '--'.join([escape(index.min()), escape(index.max())]) mkdir(self.dirname(partition_name)) # Store columns for col in df.columns: pack_file(df[col].values, self.dirname(partition_name, col)) # Store index fn = self.dirname(partition_name, '.index') bloscpack.pack_ndarray_file(index, fn, bloscpack_args=bp_args, blosc_args=blosc_args(index.dtype)) if not len(self.partitions): self.minimum = coerce_index(index.dtype, index.min()) self.partitions.loc[index.max()] = partition_name self.flush() def extend_sequence(self, seq, freq=None): """Add dataframes from an iterable, optionally repartitioning by freq. Parameters ---------- seq : iterable An iterable of dataframes freq : frequency, optional A pandas datetime offset. If provided, the dataframes will be partitioned by this frequency. """ if self._readonly: raise IOError('File not open for writing') if isinstance(freq, str): freq = pd.datetools.to_offset(freq) partitioner = lambda buf, df: partitionby_freq(freq, buf, df) elif freq is None: partitioner = partitionby_none else: raise ValueError("Invalid 'freq': {0}".format(repr(freq))) buf = self._empty_dataframe() for df in seq: write, buf = partitioner(buf, df) for frame in write: self.extend(frame) if buf is not None and not buf.empty: self.extend(buf) def dirname(self, *args): return os.path.join(self.path, *list(map(escape, args))) def load_partition(self, name, columns, categorize=True): if isinstance(columns, Iterator): columns = list(columns) if '.index' in self.categories and name in self.partitions.index: name = self.categories['.index'].index(name) - 1 if not isinstance(columns, list): df = self.load_partition(name, [columns], categorize=categorize) return df.iloc[:, 0] arrays = [unpack_file(self.dirname(name, col)) for col in columns] df = pd.DataFrame(dict(zip(columns, arrays)), columns=pd.Index(columns, name=self.axis_names[1], tupleize_cols=False), index=self.load_index(name)) if categorize: df = _categorize(self.categories, df) return df def load_index(self, name): return pd.Index(unpack_file(self.dirname(name, '.index')), dtype=self.index_dtype, name=self.axis_names[0], tupleize_cols=False) def __getitem__(self, key): if isinstance(key, tuple): key, columns = key else: columns = self.columns if isinstance(columns, slice): columns = self.columns[columns] if isinstance(key, slice): start, stop = key.start, key.stop else: start, stop = key, key if '.index' in self.categories: if start is not None: start = self.categories['.index'].index(start) if stop is not None: stop = self.categories['.index'].index(stop) key = slice(start, stop) names = select_partitions(self.partitions, key) if not names: return self._empty_dataframe()[columns] data_frames = [self.load_partition(name, columns, categorize=False) for name in names] data_frames[0] = data_frames[0].loc[start:] data_frames[-1] = data_frames[-1].loc[:stop] df = pd.concat(data_frames) df = _categorize(self.categories, df) return df def drop(self): if self._readonly: raise IOError('File not open for writing') if os.path.exists(self.path): shutil.rmtree(self.path) def flush(self): if self._readonly: raise IOError('File not open for writing') self.save_partitions() def __enter__(self): return self def __exit__(self, *args): if not self._explicitly_given_path: self.drop() elif not self._readonly: self.flush() __del__ = __exit__ def __getstate__(self): if not self._readonly: self.flush() return (self.path, self._explicitly_given_path, self._readonly) def __setstate__(self, state): self.path = state[0] self._explicitly_given_path = state[1] self._readonly = state[2] self.load_meta() self.load_partitions() self.load_categories() def to_dask(self, columns=None): import dask.dataframe as dd meta = self._empty_dataframe() if columns is None: columns = self.columns else: meta = meta[columns] token = md5(str((self.path, os.path.getmtime(self.path))).encode()).hexdigest() name = 'from-castra-' + token divisions = [self.minimum] + self.partitions.index.tolist() if '.index' in self.categories: divisions = ([self.categories['.index'][0]] + [self.categories['.index'][d + 1] for d in divisions[1:-1]] + [self.categories['.index'][-1]]) key_parts = list(enumerate(self.partitions.values)) dsk = dict(((name, i), (Castra.load_partition, self, part, columns)) for i, part in key_parts) if isinstance(columns, list): return dd.DataFrame(dsk, name, meta, divisions) else: return dd.Series(dsk, name, meta, divisions) def pack_file(x, fn, encoding='utf8'): """ Pack numpy array into filename Supports binary data with bloscpack and text data with msgpack+blosc >>> pack_file(np.array([1, 2, 3]), 'foo.blp') # doctest: +SKIP See also: unpack_file """ if x.dtype != 'O': bloscpack.pack_ndarray_file(x, fn, bloscpack_args=bp_args, blosc_args=blosc_args(x.dtype)) else: bytes = blosc.compress(msgpack.packb(x.tolist(), encoding=encoding), 1) with open(fn, 'wb') as f: f.write(bytes) def unpack_file(fn, encoding='utf8'): """ Unpack numpy array from filename Supports binary data with bloscpack and text data with msgpack+blosc >>> unpack_file('foo.blp') # doctest: +SKIP array([1, 2, 3]) See also: pack_file """ try: return bloscpack.unpack_ndarray_file(fn) except ValueError: with open(fn, 'rb') as f: data = msgpack.unpackb(blosc.decompress(f.read()), encoding=encoding) return np.array(data, object, copy=False) def coerce_index(dt, o): if np.issubdtype(dt, np.datetime64): return pd.Timestamp(o) return o def select_partitions(partitions, key): """ Select partitions from partition list given slice >>> p = pd.Series(['a', 'b', 'c', 'd', 'e'], index=[0, 10, 20, 30, 40]) >>> select_partitions(p, slice(3, 25)) ['b', 'c', 'd'] """ assert key.step is None, 'step must be None but was %s' % key.step start, stop = key.start, key.stop if start is not None: start = coerce_index(partitions.index.dtype, start) istart = partitions.index.searchsorted(start) else: istart = 0 if stop is not None: stop = coerce_index(partitions.index.dtype, stop) istop = partitions.index.searchsorted(stop) else: istop = len(partitions) - 1 names = partitions.iloc[istart: istop + 1].values.tolist() return names def _decategorize(categories, df): """ Strip object dtypes from dataframe, update categories Given a DataFrame >>> df = pd.DataFrame({'x': [1, 2, 3], 'y': ['C', 'B', 'B']}) And a dict of known categories >>> _ = categories = {'y': ['A', 'B']} Update dict and dataframe in place >>> extra, categories, df = _decategorize(categories, df) >>> extra {'y': ['C']} >>> categories {'y': ['A', 'B', 'C']} >>> df x y 0 1 2 1 2 1 2 3 1 """ extra = dict() new_categories = dict() new_columns = dict((col, df[col].values) for col in df.columns) for col, cat in categories.items(): if col == '.index' or col not in df.columns: continue idx = pd.Index(df[col]) idx = getattr(idx, 'categories', idx) ex = idx[~idx.isin(cat)].unique() if any(pd.isnull(c) for c in cat): ex = ex[~pd.isnull(ex)] extra[col] = ex.tolist() new_categories[col] = cat + extra[col] new_columns[col] = pd.Categorical(df[col].values, new_categories[col]).codes if '.index' in categories: idx = df.index idx = getattr(idx, 'categories', idx) ex = idx[~idx.isin(cat)].unique() if any(pd.isnull(c) for c in cat): ex = ex[~pd.isnull(ex)] extra['.index'] = ex.tolist() new_categories['.index'] = cat + extra['.index'] new_index = pd.Categorical(df.index, new_categories['.index']).codes new_index = pd.Index(new_index, name=df.index.name) else: new_index = df.index new_df = pd.DataFrame(new_columns, columns=df.columns, index=new_index) return extra, new_categories, new_df def make_categorical(s, categories): name = '.index' if isinstance(s, pd.Index) else s.name if name in categories: idx = pd.Index(categories[name], tupleize_cols=False, dtype='object') idx.is_unique = True cat = pd.Categorical(s.values, categories=idx, fastpath=True, ordered=False) return pd.CategoricalIndex(cat, name=s.name, ordered=True) if name == '.index' else cat return s if name == '.index' else s.values def _categorize(categories, df): """ Categorize columns in dataframe >>> df = pd.DataFrame({'x': [1, 2, 3], 'y': [0, 2, 0]}) >>> categories = {'y': ['A', 'B', 'c']} >>> _categorize(categories, df) x y 0 1 A 1 2 c 2 3 A """ if isinstance(df, pd.Series): return pd.Series(make_categorical(df, categories), index=make_categorical(df.index, categories), name=df.name) else: return pd.DataFrame(dict((col, make_categorical(df[col], categories)) for col in df.columns), columns=df.columns, index=make_categorical(df.index, categories)) def partitionby_none(buf, new): """Repartition to ensure partitions don't split duplicate indices""" if new.empty: return [], buf elif buf.empty: return [], new if not new.index.is_monotonic_increasing: new = new.sort_index(inplace=False) end = buf.index[-1] if end >= new.index[0] and not is_trivial_index(new.index): i = new.index.searchsorted(end, side='right') # Only need to concat, `castra.extend` will resort if needed buf = pd.concat([buf, new.iloc[:i]]) new = new.iloc[i:] return [buf], new def partitionby_freq(freq, buf, new): """Partition frames into blocks by a freq""" df = pd.concat([buf, new]) if not df.index.is_monotonic_increasing: df = df.sort_index(inplace=False) start, end = pd.tseries.resample._get_range_edges(df.index[0], df.index[-1], freq) inds = [df.index.searchsorted(i) for i in pd.date_range(start, end, freq=freq)[1:]] slices = [(inds[i-1], inds[i]) if i else (0, inds[i]) for i in range(len(inds))] frames = [df.iloc[i:j] for (i, j) in slices] return frames[:-1], frames[-1] def is_trivial_index(ind): """ Is this index just 0..n ? If so then we can probably ignore or change it around as necessary >>> is_trivial_index(pd.Index([0, 1, 2])) True >>> is_trivial_index(pd.Index([0, 3, 5])) False """ return ind[0] == 0 and (ind == np.arange(len(ind))).all() ================================================ FILE: castra/tests/test_core.py ================================================ import os import tempfile import pickle import shutil import pandas as pd import pandas.util.testing as tm import pytest import numpy as np from castra import Castra from castra.core import mkdir, select_partitions, _decategorize, _categorize A = pd.DataFrame({'x': [1, 2], 'y': [1., 2.]}, columns=['x', 'y'], index=[1, 2]) B = pd.DataFrame({'x': [10, 20], 'y': [10., 20.]}, columns=['x', 'y'], index=[10, 20]) C = pd.DataFrame({'x': [10, 20], 'y': [10., 20.], 'z': [0, 1]}, columns=['x', 'y', 'z']).set_index('z') C.columns.name = 'cols' @pytest.yield_fixture def base(): d = tempfile.mkdtemp(prefix='castra-') try: yield d finally: shutil.rmtree(d) def test_safe_mkdir_with_new(base): path = os.path.join(base, 'db') mkdir(path) assert os.path.exists(path) assert os.path.isdir(path) def test_safe_mkdir_with_existing(base): # an existing path should not raise an exception mkdir(base) def test_create_with_random_directory(): Castra(template=A) def test_create_with_non_existing_path(base): path = os.path.join(base, 'db') Castra(path=path, template=A) def test_create_with_existing_path(base): Castra(path=base, template=A) def test_get_empty(base): df = Castra(path=base, template=A)[:] assert (df.columns == A.columns).all() def test_get_empty_result(base): c = Castra(path=base, template=A) c.extend(A) df = c[100:200] assert (df.columns == A.columns).all() def test_get_slice(base): c = Castra(path=base, template=A) c.extend(A) tm.assert_frame_equal(c[:], c[:, :]) tm.assert_frame_equal(c[:, 1:], c[:][['y']]) def test_exception_with_non_dir(base): file_ = os.path.join(base, 'file') with open(file_, 'w') as f: f.write('file') with pytest.raises(ValueError): Castra(file_) def test_exception_with_existing_castra_and_template(base): with Castra(path=base, template=A) as c: c.extend(A) with pytest.raises(ValueError): Castra(path=base, template=A) def test_exception_with_empty_dir_and_no_template(base): with pytest.raises(ValueError): Castra(path=base) def test_load(base): with Castra(path=base, template=A) as c: c.extend(A) c.extend(B) loaded = Castra(path=base) tm.assert_frame_equal(pd.concat([A, B]), loaded[:]) def test_del_with_random_dir(): c = Castra(template=A) assert os.path.exists(c.path) c.__del__() assert not os.path.exists(c.path) def test_context_manager_with_random_dir(): with Castra(template=A) as c: assert os.path.exists(c.path) assert not os.path.exists(c.path) def test_context_manager_with_specific_dir(base): with Castra(path=base, template=A) as c: assert os.path.exists(c.path) assert os.path.exists(c.path) def test_timeseries(): indices = [pd.DatetimeIndex(start=str(i), end=str(i+1), freq='w') for i in range(2000, 2015)] dfs = [pd.DataFrame({'x': list(range(len(ind)))}, ind).iloc[:-1] for ind in indices] with Castra(template=dfs[0]) as c: for df in dfs: c.extend(df) df = c['2010-05': '2013-02'] assert len(df) > 100 def test_Castra(): c = Castra(template=A) c.extend(A) c.extend(B) assert c.columns == ['x', 'y'] tm.assert_frame_equal(c[0:100], pd.concat([A, B])) tm.assert_frame_equal(c[:5], A) tm.assert_frame_equal(c[5:], B) tm.assert_frame_equal(c[2:5], A[1:]) tm.assert_frame_equal(c[2:15], pd.concat([A[1:], B[:1]])) def test_pickle_Castra(): path = tempfile.mkdtemp(prefix='castra-') c = Castra(path=path, template=A) c.extend(A) c.extend(B) dumped = pickle.dumps(c) undumped = pickle.loads(dumped) tm.assert_frame_equal(pd.concat([A, B]), undumped[:]) def test_text(): df = pd.DataFrame({'name': ['Alice', 'Bob'], 'balance': [100, 200]}, columns=['name', 'balance']) with Castra(template=df) as c: c.extend(df) tm.assert_frame_equal(c[:], df) def test_column_access(): with Castra(template=A) as c: c.extend(A) c.extend(B) df = c[:, ['x']] tm.assert_frame_equal(df, pd.concat([A[['x']], B[['x']]])) df = c[:, 'x'] tm.assert_series_equal(df, pd.concat([A.x, B.x])) def test_reload(): path = tempfile.mkdtemp(prefix='castra-') try: c = Castra(template=A, path=path) c.extend(A) d = Castra(path=path) assert c.columns == d.columns assert (c.partitions == d.partitions).all() assert c.minimum == d.minimum finally: shutil.rmtree(path) def test_readonly(): path = tempfile.mkdtemp(prefix='castra-') try: c = Castra(path=path, template=A) c.extend(A) d = Castra(path=path, readonly=True) with pytest.raises(IOError): d.extend(B) with pytest.raises(IOError): d.extend_sequence([B]) with pytest.raises(IOError): d.flush() with pytest.raises(IOError): d.drop() with pytest.raises(IOError): d.save_partitions() with pytest.raises(IOError): d.flush_meta() assert c.columns == d.columns assert (c.partitions == d.partitions).all() assert c.minimum == d.minimum finally: shutil.rmtree(path) def test_index_dtype_matches_template(): with Castra(template=A) as c: assert c.partitions.index.dtype == A.index.dtype def test_to_dask_dataframe(): pytest.importorskip('dask.dataframe') try: import dask.dataframe as dd except ImportError: return with Castra(template=A) as c: c.extend(A) c.extend(B) df = c.to_dask() assert isinstance(df, dd.DataFrame) assert list(df.divisions) == [1, 2, 20] tm.assert_frame_equal(df.compute(), c[:]) df = c.to_dask('x') assert isinstance(df, dd.Series) assert list(df.divisions) == [1, 2, 20] tm.assert_series_equal(df.compute(), c[:, 'x']) def test_categorize(): A = pd.DataFrame({'x': [1, 2, 3], 'y': ['A', None, 'A']}, columns=['x', 'y'], index=[0, 10, 20]) B = pd.DataFrame({'x': [4, 5, 6], 'y': ['C', None, 'A']}, columns=['x', 'y'], index=[30, 40, 50]) with Castra(template=A, categories=['y']) as c: c.extend(A) assert c[:].dtypes['y'] == 'category' assert c[:]['y'].cat.codes.dtype == np.dtype('i1') assert list(c[:, 'y'].cat.categories) == ['A', None] c.extend(B) assert list(c[:, 'y'].cat.categories) == ['A', None, 'C'] assert c.load_partition(c.partitions.iloc[0], 'y').dtype == 'category' c.flush() d = Castra(path=c.path) tm.assert_frame_equal(c[:], d[:]) def test_save_axis_names(): with Castra(template=C) as c: c.extend(C) assert c[:].index.name == 'z' assert c[:].columns.name == 'cols' tm.assert_frame_equal(c[:], C) def test_same_categories_when_already_categorized(): A = pd.DataFrame({'x': [1, 2] * 1000, 'y': [1., 2.] * 1000, 'z': np.random.choice(list('abc'), size=2000)}, columns=list('xyz')) A['z'] = A.z.astype('category') with Castra(template=A, categories=['z']) as c: c.extend(A) assert c.categories['z'] == A.z.cat.categories.tolist() def test_category_dtype(): A = pd.DataFrame({'x': [1, 2] * 3, 'y': [1., 2.] * 3, 'z': list('abcabc')}, columns=list('xyz')) with Castra(template=A, categories=['z']) as c: c.extend(A) assert A.dtypes['z'] == 'object' def test_do_not_create_dirs_if_template_fails(): A = pd.DataFrame({'x': [1, 2] * 3, 'y': [1., 2.] * 3, 'z': list('abcabc')}, columns=list('xyz')) with pytest.raises(ValueError): Castra(template=A, path='foo', categories=['w']) assert not os.path.exists('foo') def test_sort_on_extend(): df = pd.DataFrame({'x': [1, 2, 3]}, index=[3, 2, 1]) expected = pd.DataFrame({'x': [3, 2, 1]}, index=[1, 2, 3]) with Castra(template=df) as c: c.extend(df) tm.assert_frame_equal(c[:], expected) def test_select_partitions(): p = pd.Series(['a', 'b', 'c', 'd', 'e'], index=[0, 10, 20, 30, 40]) assert select_partitions(p, slice(3, 25)) == ['b', 'c', 'd'] assert select_partitions(p, slice(None, 25)) == ['a', 'b', 'c', 'd'] assert select_partitions(p, slice(3, None)) == ['b', 'c', 'd', 'e'] assert select_partitions(p, slice(None, None)) == ['a', 'b', 'c', 'd', 'e'] assert select_partitions(p, slice(10, 30)) == ['b', 'c', 'd'] def test_first_index_is_timestamp(): pytest.importorskip('dask.dataframe') df = pd.DataFrame({'x': [1, 2] * 3, 'y': [1., 2.] * 3, 'z': list('abcabc')}, columns=list('xyz'), index=pd.date_range(start='20120101', periods=6)) with Castra(template=df) as c: c.extend(df) assert isinstance(c.minimum, pd.Timestamp) assert isinstance(c.to_dask().divisions[0], pd.Timestamp) def test_minimum_dtype(): df = tm.makeTimeDataFrame() with Castra(template=df) as c: c.extend(df) assert type(c.minimum) == type(c.partitions.index[0]) def test_many_default_indexes(): a = pd.DataFrame({'x': [1, 2, 3]}) b = pd.DataFrame({'x': [4, 5, 6]}) c = pd.DataFrame({'x': [7, 8, 9]}) e = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6, 7, 8, 9]}) with Castra(template=a) as C: C.extend(a) C.extend(b) C.extend(c) tm.assert_frame_equal(C[:], e) def test_raise_error_on_mismatched_index(): x = pd.DataFrame({'x': [1, 2, 3]}, index=[1, 2, 3]) y = pd.DataFrame({'x': [1, 2, 3]}, index=[4, 5, 6]) z = pd.DataFrame({'x': [4, 5, 6]}, index=[5, 6, 7]) with Castra(template=x) as c: c.extend(x) c.extend(y) with pytest.raises(ValueError): c.extend(z) def test_raise_error_on_equal_index(): a = pd.DataFrame({'x': [1, 2, 3]}, index=[1, 2, 3]) b = pd.DataFrame({'x': [4, 5, 6]}, index=[3, 4, 5]) with Castra(template=a) as c: c.extend(a) with pytest.raises(ValueError): c.extend(b) def test_categories_nan(): a = pd.DataFrame({'x': ['A', np.nan]}) b = pd.DataFrame({'x': ['B', np.nan]}) with Castra(template=a, categories=['x']) as c: c.extend(a) c.extend(b) assert len(c.categories['x']) == 3 def test_extend_sequence_freq(): df = pd.util.testing.makeTimeDataFrame(1000, 'min') seq = [df.iloc[i:i+100] for i in range(0,1000,100)] with Castra(template=df) as c: c.extend_sequence(seq, freq='h') tm.assert_frame_equal(c[:], df) parts = pd.date_range(start=df.index[59], freq='h', periods=16).insert(17, df.index[-1]) tm.assert_index_equal(c.partitions.index, parts) with Castra(template=df) as c: c.extend_sequence(seq, freq='d') tm.assert_frame_equal(c[:], df) assert len(c.partitions) == 1 def test_extend_sequence_none(): data = {'a': range(5), 'b': range(5)} p1 = pd.DataFrame(data, index=[1, 2, 3, 4, 5]) p2 = pd.DataFrame(data, index=[5, 5, 5, 6, 7]) p3 = pd.DataFrame(data, index=[7, 9, 10, 11, 12]) seq = [p1, p2, p3] df = pd.concat(seq) with Castra(template=df) as c: c.extend_sequence(seq) tm.assert_frame_equal(c[:], df) assert len(c.partitions) == 3 assert len(c.load_partition('1--5', ['a', 'b']).index) == 8 assert len(c.load_partition('6--7', ['a', 'b']).index) == 3 assert len(c.load_partition('9--12', ['a', 'b']).index) == 4 def test_extend_sequence_overlap(): df = pd.util.testing.makeTimeDataFrame(20, 'min') p1 = df.iloc[:15] p2 = df.iloc[10:20] seq = [p1,p2] df = pd.concat(seq) with Castra(template=df) as c: c.extend_sequence(seq) tm.assert_frame_equal(c[:], df.sort_index()) assert (c.partitions.index == [p.index[-1] for p in seq]).all() # Check with trivial index p1 = pd.DataFrame({'a': range(10), 'b': range(10)}) p2 = pd.DataFrame({'a': range(10, 17), 'b': range(10, 17)}) seq = [p1,p2] df = pd.DataFrame({'a': range(17), 'b': range(17)}) with Castra(template=df) as c: c.extend_sequence(seq) tm.assert_frame_equal(c[:], df) assert (c.partitions.index == [9, 16]).all() def test_extend_sequence_single_frame(): df = pd.util.testing.makeTimeDataFrame(100, 'h') seq = [df] with Castra(template=df) as c: c.extend_sequence(seq, freq='d') assert (c.partitions.index == ['2000-01-01 23:00:00', '2000-01-02 23:00:00', '2000-01-03 23:00:00', '2000-01-04 23:00:00', '2000-01-05 03:00:00']).all() df = pd.DataFrame({'a': range(10), 'b': range(10)}) seq = [df] with Castra(template=df) as c: c.extend_sequence(seq) tm.assert_frame_equal(c[:], df) def test_column_with_period(): df = pd.DataFrame({'x': [10, 20], '.': [10., 20.]}, columns=['x', '.'], index=[10, 20]) with Castra(template=df) as c: c.extend(df) def test_empty(): with Castra(template=A) as c: c.extend(pd.DataFrame(columns=A.columns)) assert len(c[:]) == 0 def test_index_with_single_value(): df = pd.DataFrame({'x': [1, 2, 3]}, index=[1, 1, 2]) with Castra(template=df) as c: c.extend(df) tm.assert_frame_equal(c[1], df.loc[1]) def test_categorical_index(): df = pd.DataFrame({'x': [1, 2, 3]}, index=pd.CategoricalIndex(['a', 'a', 'b'], ordered=True, name='foo')) with Castra(template=df, categories=True) as c: c.extend(df) result = c[:] tm.assert_frame_equal(c[:], df) A = pd.DataFrame({'x': [1, 2, 3]}, index=pd.Index(['a', 'a', 'b'], name='foo')) B = pd.DataFrame({'x': [4, 5, 6]}, index=pd.Index(['c', 'd', 'd'], name='foo')) path = tempfile.mkdtemp(prefix='castra-') try: with Castra(path=path, template=A, categories=['foo']) as c: c.extend(A) c.extend(B) c2 = Castra(path=path) result = c2[:] expected = pd.concat([A, B]) expected.index = pd.CategoricalIndex(expected.index, name=expected.index.name, ordered=True) tm.assert_frame_equal(result, expected) tm.assert_frame_equal(c['a'], expected.loc['a']) finally: shutil.rmtree(path) def test_categorical_index_with_dask_dataframe(): pytest.importorskip('dask.dataframe') import dask.dataframe as dd import dask A = pd.DataFrame({'x': [1, 2, 3, 4]}, index=pd.Index(['a', 'a', 'b', 'b'], name='foo')) B = pd.DataFrame({'x': [4, 5, 6]}, index=pd.Index(['c', 'd', 'd'], name='foo')) path = tempfile.mkdtemp(prefix='castra-') try: with Castra(path=path, template=A, categories=['foo']) as c: c.extend(A) c.extend(B) df = dd.from_castra(path) assert df.divisions == ('a', 'c', 'd') result = df.compute(get=dask.async.get_sync) expected = pd.concat([A, B]) expected.index = pd.CategoricalIndex(expected.index, name=expected.index.name, ordered=True) tm.assert_frame_equal(result, expected) tm.assert_frame_equal(df.loc['a'].compute(), expected.loc['a']) tm.assert_frame_equal(df.loc['b'].compute(get=dask.async.get_sync), expected.loc['b']) finally: shutil.rmtree(path) def test__decategorize(): df = pd.DataFrame({'x': [1, 2, 3]}, index=pd.CategoricalIndex(['a', 'a', 'b'], ordered=True, name='foo')) extra, categories, df2 = _decategorize({'.index': []}, df) assert (df2.index == [0, 0, 1]).all() df3 = _categorize(categories, df2) tm.assert_frame_equal(df, df3) ================================================ FILE: requirements.txt ================================================ numpy pandas bloscpack>=0.8.0 blosc ================================================ FILE: setup.py ================================================ #!/usr/bin/env python from os.path import exists from setuptools import setup setup(name='castra', version='0.1.8', description='On-disk partitioned store', url='http://github.com/blaze/Castra/', maintainer='Matthew Rocklin', maintainer_email='mrocklin@gmail.com', license='BSD', keywords='', packages=['castra'], package_data={'castra': ['tests/*.py']}, install_requires=list(open('requirements.txt').read().strip().split('\n')), long_description=(open('README.rst').read() if exists('README.rst') else ''), zip_safe=False)