Repository: robdmc/consecution Branch: develop Commit: c23b4ea20fb7 Files: 29 Total size: 119.8 KB Directory structure: gitextract_eotr679u/ ├── .coveragerc ├── .gitignore ├── .travis.yml ├── LICENSE ├── README.md ├── consecution/ │ ├── .coverage │ ├── __init__.py │ ├── nodes.py │ ├── pipeline.py │ ├── tests/ │ │ ├── __init__.py │ │ ├── nodes_tests.py │ │ ├── pipeline_tests.py │ │ ├── testing_helpers.py │ │ └── utils_tests.py │ └── utils.py ├── docker/ │ ├── Dockerfile │ ├── docker_build.sh │ ├── docker_run.sh │ └── simple_example.py ├── docs/ │ ├── Makefile │ ├── conf.py │ ├── index.rst │ ├── ref/ │ │ └── consecution.rst │ └── toc.rst ├── pandashells.md ├── publish.py ├── sample_data.csv ├── setup.cfg └── setup.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .coveragerc ================================================ [report] show_missing = True ================================================ FILE: .gitignore ================================================ .DS_Store *.pyc ================================================ FILE: .travis.yml ================================================ sudo: false language: python python: - '2.7' - '3.4' - '3.5' - '3.6' - '3.7' install: - pip install -e .[dev] before_script: - flake8 . script: - nosetests - coverage report --fail-under=100 after_success: - coveralls notifications: email: false addons: apt: packages: - graphviz ================================================ FILE: LICENSE ================================================ Copyright (c) 2015, Robert deCarvalho All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. The views and conclusions contained in the software and documentation are those of the authors and should not be interpreted as representing official policies, either expressed or implied, of the FreeBSD Project. ================================================ FILE: README.md ================================================ Update (2/23/2021) === It looks like this README is slowly turning into a reference of all the projects in this space that I think are better than consecution. Here is [metaflow](https://github.com/Netflix/metaflow), an offering from Netflix. Update (9/21/2020) === Another library that I believe to be better than consecution is the [pypeln](https://cgarciae.github.io/pypeln/) project. The way it allows for a different number of workers on each node of a pipeline is quite nice. Additionally the ability to control whether each node is run using threads, processes, async, or sync is really useful. Update (5/1/2020) === Since writing this, the excellent [streamz](https://streamz.readthedocs.io/en/latest/) package has been created. Streamz is the project I wish had existed back when I wrote this. It is a much more capable implementation of the of the core ideas of consecution, and plays nicely with [dask](https://dask.org/) to achieve scale. I have started using streamz in my work in place of consecution. Consecution === [![Build Status](https://travis-ci.org/robdmc/consecution.svg?branch=develop)](https://travis-ci.org/robdmc/consecution) [![Coverage Status](https://coveralls.io/repos/github/robdmc/consecution/badge.svg?branch=develop)](https://coveralls.io/github/robdmc/consecution?branch=add_docs) Introduction --- Consecution is: * An easy-to-use pipeline abstraction inspired by Apache Storm Topologies * Designed to simplify building ETL pipelines that are robust and easy to test * A system for wiring together simple processing nodes to form a DAG, which is fed with a python iterable * Built using synchronous, single-threaded execution strategies designed to run efficiently on a single core * Implemented in pure-python with optional requirements that are needed only for graph visualization * Written with 100% test coverage Consecution makes it easy to build systems like this. ![Output Image](/images/etl_example.png?raw=true "ETL Example") Installation --- Consecution is a pure-python package that is simply installed with pip. The only non-essential requirement is the Graphviz system package, which is only needed if you want to create graphical representations of your pipeline.
[~]$ pip install consecution
Docker --- If you would like to try out consecution on docker, check out consecution from github and navigate to the `docker/` subdirectory. From there, run the following. * Build the consecution image: `docker_build.sh` * Start a container: `docker_run.sh` * Once in the container, run the example: `python simple_example.py` Quick Start --- What follows is a quick tour of consecution. See the API documentation for more detailed information. ### Nodes Consecution works by wiring together nodes. You create nodes by inheriting from the `consecution.Node` class. Every node must define a `.process()` method. This method contains whatever logic you want for processing single items as they pass through your pipeline. Here is an example of a node that simply logs items passing through it. ```python from consecution import Node class LogNode(Node): def process(self, item): # any logic you want for processing single item print('{: >15} processing {}'.format(self.name, item)) # send item downstream self.push(item) ``` ### Pipelines Now let's create a pipeline that wires together a series of these logging nodes. We do this by employing the pipe symbol in much the same way that you pipe data between programs in unix. Note that you must name nodes when you instantiate them. ```python from consecution import Node, Pipeline # This is the same node class we defined above class LogNode(Node): def process(self, item): print('{} processing {}'.format(self.name, item)) self.push(item) # Connect nodes with pipe symbols to create pipeline for consuming any iterable. pipe = Pipeline( LogNode('extract') | LogNode('transform') | LogNode('load') ) ``` At this point, we can visualize the pipeline to verify that the topology is what we expect it to be. If you Graphviz installed, you can now simply type one of the following to see the pipeline visualized. ```python # Create a pipeline.png file in your working directory pipe.plot() # Interactively display the pipeline visualization in an IPython notebook # by simply making the final expression in a cell evaluate to a pipeline. pipe ``` The plot command should produce the following visualization. ![Output Image](/images/etl1.png?raw=true "Three Node ETL Example") If you don't have Graphviz installed, you can print the pipeline object to get a text-based visualization. ```python print(pipe) ``` This represents your pipeline as a series of pipe statements showing how data is piped between nodes. ``` Pipeline -------------------------------------------------------------------- extract | transform transform | load -------------------------------------------------------------------- ``` We can now process an iterable with our pipeline by running ```python pipe.consume(range(5)) ``` which will print the following to the console. ``` extract processing 0 transform processing 0 load processing 0 extract processing 1 transform processing 1 load processing 1 extract processing 2 transform processing 2 load processing 2 ``` ### Broadcasting Piping the output of a single node into a list of nodes will cause the single node to broadcast its pushed items to every item in the list. So, again, using our logging node, we could construct a pipeline like this: ```python from consecution import Node, Pipeline class LogNode(Node): def process(self, item): print('{} processing {}'.format(self.name, item)) self.push(item) # pipe to a list of nodes to broadcast items pipe = Pipeline( LogNode('extract') | LogNode('transform') | [LogNode('load_redis'), LogNode('load_postgres'), LogNode('load_mongo')] ) pipe.plot() pipe.consume(range(2)) ``` The plot command produces this visualization ![Output Image](/images/broadcast.png?raw=true "Broadcast Example") and consuming `range(2)` produces this output ``` extract processing 0 transform processing 0 load_redis processing 0 load_postgres processing 0 load_mongo processing 0 extract processing 1 transform processing 1 load_redis processing 1 load_postgres processing 1 load_mongo processing 1 ``` ### Routing If you pipe to a list that contains multiple nodes and a single callable, then consecution will interpret the callable as a routing function that accepts a single item as its only argument and returns the name of one of the nodes in the list. The routing function will direct the flow of items as illustrated below. ```python from consecution import Node, Pipeline class LogNode(Node): def process(self, item): print('{: >15} processing {}'.format(self.name, item)) self.push(item) def parity(item): if item % 2 == 0: return 'transform_even' else: return 'transform_odd' # pipe to a list containing a callable to achieve routing behaviour pipe = Pipeline( LogNode('extract') | [LogNode('transform_even'), LogNode('transform_odd'), parity] ) pipe.plot() pipe.consume(range(4)) ``` The plot command produces the following pipeline ![Output Image](/images/routing.png?raw=true "Routing Example") and consuming `range(4)` produces this output ``` extract processing 0 transform_even processing 0 extract processing 1 transform_odd processing 1 extract processing 2 transform_even processing 2 extract processing 3 transform_odd processing 3 ``` ### Merging Up to this point, we have the ability to create processing trees where nodes can either broadcast to or route between their downstream nodes. We can, however, do more then this and create DAGs (Directed-Acyclic-Graphs). Piping from a list back to a single node will merge the output of all nodes in the list together into the single downstream node like this. ```python from consecution import Node, Pipeline class LogNode(Node): def process(self, item): print('{: >15} processing {}'.format(self.name, item)) self.push(item) def parity(item): if item % 2 == 0: return 'transform_even' else: return 'transform_odd' # piping from a list back to a single node merges items into downstream node pipe = Pipeline( LogNode('extract') | [LogNode('transform_even'), LogNode('transform_odd'), parity] | LogNode('load') ) pipe.plot() pipe.consume(range(4)) ``` The plot command produces the following pipeline ![Output Image](/images/dag.png?raw=true "DAG Example") and consuming `range(4)` produces this output ``` extract processing 0 transform_even processing 0 load processing 0 extract processing 1 transform_odd processing 1 load processing 1 extract processing 2 transform_even processing 2 load processing 2 extract processing 3 transform_odd processing 3 load processing 3 ``` ### Managing Local State Nodes are classes, and as such, you have the freedom to create any attribute you want on a node. You can actually define two additional methods on your nodes to set up and tear down node-local state. It is important to note the order of execution here. All nodes in a pipeline will execute their `.begin()` methods in pipeline-order before any items are processed. Each node will enter its `.end()` method only after it has processed all items, and after all parent nodes have finished their respective `.end()` methods. Below, we've modified our LogNode to keep a running sum of all items that pass through it and end by printing their sum. ```python from consecution import Node, Pipeline class LogNode(Node): def begin(self): self.sum = 0 print('{}.begin()'.format(self.name)) def process(self, item): print('{: >15} processing {}'.format(self.name, item)) self.sum += item self.push(item) def end(self): print('sum = {:d} in {}.end()'.format(self.sum, self.name)) # Identical pipeline to merge example above, but with modified LogNode pipe = Pipeline( LogNode('extract') | [LogNode('transform_even'), LogNode('transform_odd'), parity] | LogNode('load') ) pipe.consume(range(4)) ``` Consuming `range(4)` produces the following output ``` extract.begin() transform_even.begin() transform_odd.begin() load.begin() extract processing 0 transform_even processing 0 load processing 0 extract processing 1 transform_odd processing 1 load processing 1 extract processing 2 transform_even processing 2 load processing 2 extract processing 3 transform_odd processing 3 load processing 3 sum = 6 in extract.end() sum = 2 in transform_even.end() sum = 4 in transform_odd.end() sum = 6 in load.end() ``` ### Managing Global State Every node object has a `.global_state` attribute that is shared globally across all nodes in the pipeline. The attribute is also available on the Pipeline object itself. The GlobalState object is a simple mutable python object whose attributes can be mutated by any node. It also remains accesible on the Pipeline object after all nodes have completed. Below is a simple example of mutating and accessing global state. ```python from consecution import Node, Pipeline, GlobalState class LogNode(Node): def process(self, item): self.global_state.messages.append( '{: >15} processing {}'.format(self.name, item) ) self.push(item) # create a global state object with a messages attribute global_state = GlobalState(messages=[]) # Assign the predefined global_state to the pipeline pipe = Pipeline( LogNode('extract') | LogNode('transform') | LogNode('load'), global_state=global_state) ) pipe.consume(range(3)) # print the content of the global state message list for msg in pipe.global_state.messages: print msg ``` Printing the contents of the messages list produces ``` extract processing 0 transform processing 0 load processing 0 extract processing 1 transform processing 1 load processing 1 extract processing 2 transform processing 2 load processing 2 ``` ## Common Patterns This section shows examples of how to implement some common patterns in consecution. ### Map Mapping with nodes is very simple. Just push an altered item downstream. ```python from consecution import Node, Pipeline class Mapper(Node): def process(self, item): self.push(2 * item) class LogNode(Node): def process(self, item): print('{: >15} processing {}'.format(self.name, item)) self.push(item) pipe = Pipeline( LogNode('extractor') | Mapper('mapper') | LogNode('loader') ) pipe.consume(range(3)) ``` This will produce an output of ``` extractor processing 0 loader processing 0 extractor processing 1 loader processing 2 extractor processing 2 loader processing 4 ``` ### Reduce Reducing, or folding, is easily implemented by using the `.begin()` and `.end()` methods to handle accumulated values. ```python from consecution import Node, Pipeline class Reducer(Node): def begin(self): self.result = 0 def process(self, item): self.result += item def end(self): self.push(self.result) class LogNode(Node): def process(self, item): print('{: >15} processing {}'.format(self.name, item)) self.push(item) pipe = Pipeline( LogNode('extractor') | Reducer('reducer') | LogNode('loader') ) pipe.consume(range(3)) ``` This will produce an output of ``` extractor processing 0 extractor processing 1 extractor processing 2 loader processing 3 ``` ### Filter Filtering is as simple as placing the push statement behind a conditional. All items that don't pass the conditional will not be pushed downstream, and thus silently dropped. ```python from consecution import Node, Pipeline class Filter(Node): def process(self, item): if item > 3: self.push(item) class LogNode(Node): def process(self, item): print('{: >15} processing {}'.format(self.name, item)) self.push(item) pipe = Pipeline( LogNode('extractor') | Filter('filter') | LogNode('loader') ) pipe.consume(range(6)) ``` This produces an output of ``` extractor processing 0 extractor processing 1 extractor processing 2 extractor processing 3 extractor processing 4 loader processing 4 extractor processing 5 loader processing 5 ``` ### Group By Consecution provides a specialized class you can inherit from to perform grouping operations. GroupBy nodes must define two methods: `.key(item)` and `.process(batch)`. The `.key` method should return a key from an item that is used to identify groups. Any time that key changes, a new group is initiated. Like Python's `itertools.groupby`, you will usually want the GroupByNode to process sorted items. The `.process` method functions exactly like the `.process` method on regular nodes, except that instead of being called with items, consecution will call it with a batch of items contained in a list. ```python class LogNode(Node): def process(self, item): print('{: >15} processing {}'.format(self.name, item)) self.push(item) class Batcher(GroupByNode): def key(self, item): return item // 4 def process(self, batch): sum_val = sum(batch) self.push(sum_val) pipe = Pipeline( Batcher('batcher') | LogNode('logger') ) pipe.consume(range(16)) ``` This produces an output of ``` logger processing 6 logger processing 22 logger processing 38 logger processing 54 ``` ### Plugin-Style Composition Consecution forces you to think about problems in terms of how small processing units are connected. This separation between logic and connectivity can be exploited to create flexible and reusable solutions. Basically, you specify the connectivity you want to use in solving your problem, and then plug in the processing units later. Breaking the problem up in this way allows you to swap out processing units to acheive different objectives with the same pipeline. ```python # This function defines a pipeline that can use swappable processing nodes. # We don't worry about how we are going to do logging or aggregating. # We just focus on how the nodes are connected. def pipeline_factory(log_node, agg_node): pipe = Pipeline( log_node('extractor') | agg_node('aggregator') | log_node('result_logger') ) return pipe # Now we define a node for left-justified logging class LeftLogNode(Node): def process(self, item): print('{: <15} processing {}'.format(self.name, item)) self.push(item) # And one for right-justified logging class RightLogNode(Node): def process(self, item): print('{: >15} processing {}'.format(self.name, item)) self.push(item) # We can aggregate by summing class SumNode(Node): def begin(self): self.result = 0 def process(self, item): self.result += item def end(self): self.push(self.result) # Or we can aggregate by multiplying class ProdNode(Node): def begin(self): self.result = 1 def process(self, item): self.result *= item def end(self): self.push(self.result) # Now we plug in nodes to create a pipeline that left-prints sums sum_pipeline = pipeline_factory(log_node=LeftLogNode, agg_node=SumNode) # And a different pipeline that right prints products prod_pipeline = pipeline_factory(log_node=RightLogNode, agg_node=ProdNode) print 'aggregate with sum, left justified\n' + '-'*40 sum_pipeline.consume(range(1, 5)) print '\naggregate with product, right justified\n' + '-'*40 prod_pipeline.consume(range(1, 5)) ``` This produces the following output ``` aggregate with sum, left justified ---------------------------------------- extractor processing 1 extractor processing 2 extractor processing 3 extractor processing 4 result_logger processing 10 aggregate with product, right justified ---------------------------------------- extractor processing 1 extractor processing 2 extractor processing 3 extractor processing 4 result_logger processing 24 ``` # Aggregation Example We end with a full-blown example of using a pipeline to aggregate data from a csv file. The data is contained in a csv file that looks like this. gender |age |spent --- |--- |--- male |11 |39.39 female |10 |34.72 female |15 |40.02 male |19 |26.27 male |13 |21.22 female |40 |23.17 female |52 |33.42 male |33 |39.52 female |16 |28.65 male |60 |26.74 Although there are much simpler ways of solving this problem, (e.g. with Pandashells) we deliberately construct a complex topology just to illustrate how to achieve complexity when it is actually needed. The diagram below was produced from the code beneath it. A quick glance at the diagram makes it obvious how the data is being routed through the system. The code is heavily commented to explain features of the consecution toolkit. ![Output Image](/images/gender_age.png?raw=true "Gender Age Pipeline") ```python from __future__ import print_function from collections import namedtuple from pprint import pprint import csv from consecution import Node, Pipeline, GlobalState # Named tuples are nice immutable containers # for passing data between nodes Person = namedtuple('Person', 'gender age spent') # Create a pipeline that aggregates by gender and age # In creating the pipeline we focus on connectivity and don't # worry about defining node behavior. def pipe_factory(Extractor, Agg, gender_router, age_router): # Consecution provides a generic GlobalState class. Any object can be used # as the global_state in a pipeline, but the GlobalState object provides a # nice abstraction where attributes can be accessed either by dot notation # (e.g. global_state.my_attribute) or by dictionary notation (e.g. # global_state['my_attribute']. Furthermore, GlobalState objects can be # instantiated with initialized attributes using key-word arguments as shown # here. global_state = GlobalState(segment_totals={}) # Notice, we haven't even defined the behavior of these nodes yet. They # will be defined later and are, for now, just passed into the factory # function as arguments while we focus on getting the topology right. pipe = Pipeline( Extractor('make_person') | [ gender_router, (Agg('male') | [age_router, Agg('male_child'), Agg('male_adult')]), (Agg('female') | [age_router, Agg('female_child'), Agg('female_adult')]), ], global_state=global_state ) # Nodes can be created outside of a pipeline definition adult = Agg('adult') child = Agg('child') total = Agg('total') # Sometimes the topology you want to create cannot easily be expressed # using the pipeline abstraction for wiring nodes together. You can # drop down to a lower level of abstraction by explicitly wiring nodes # together using the .add_downstream() method. adult.add_downstream(total) child.add_downstream(total) # Once a pipeline has been created, you can access individual nodes # with dictionary-like indexing on the pipeline. pipe['male_child'].add_downstream(child) pipe['female_child'].add_downstream(child) pipe['male_adult'].add_downstream(adult) pipe['female_adult'].add_downstream(adult) return pipe # Now that we have the topology of our pipeline defined, we can think about the # logic that needs to go into each node. We start by defining a node that takes # a row from a csv file and tranforms it into a namedtuple. class MakePerson(Node): def process(self, item): item['age'] = int(item['age']) item['spent'] = float(item['spent']) self.push(Person(**item)) # We now define a node to perform our aggregations. Mutable global state comes # with a lot of baggage and should be used with care. This node illustrates # how to use global state to put all aggregations in a central location that # remains accessible when the pipeline finishes processing. class Sum(Node): def begin(self): # initialize the node-local sum to zero self.total = 0 def process(self, item): # increment the node-local total and push the item down stream self.total += item.spent self.push(item) def end(self): # when pipeline is done, update global state with sum self.global_state.segment_totals[self.name] = round(self.total, 2) # This function routes tuples based on their associated gender def by_gender(item): return '{}'.format(item.gender) # This function routes tuples based on whether the purchaser was an adult or # child def by_age(item): if item.age >= 18: return '{}_adult'.format(item.gender) else: return '{}_child'.format(item.gender) # Here we plug our node definitions into our topology to create a fully-defined # pipeline. pipe = pipe_factory(MakePerson, Sum, by_gender, by_age) # We can now visualize pipeline. pipe.plot() # Now we feed our pipeline with rows from the csv file with open('sample_data.csv') as f: pipe.consume(csv.DictReader(f)) # The global_state is also available as an attribute on the pipeline allowing # us to access it when the pipeline is finished. This is a good way to "return" # an object from a pipeline. Here we simply print the result. print() pprint(pipe.global_state.segment_totals) ``` And this is the result of running the pipeline with the sample csv file. ``` {'adult': 149.12, 'child': 164.0, 'female': 159.98, 'female_adult': 56.59, 'female_child': 103.39, 'male': 153.14, 'male_adult': 92.53, 'male_child': 60.61, 'total': 313.12} ``` As illustrated in the Pandashells example, this aggregation is actually much more simple to implement in Pandas. However, there are a couple of important caveats. The Pandas solution must load the entire csv file into memory at once. If you look at the pipeline solution, you will notice that each node simply increments its local sum and passes the data downstream. At no point is the data completely loaded into memory. Although the Pandas code runs much faster due to the highly optimized vectorized math it employes, the pipeline solution can process arbitrarily large csv files with a very small memory footprint. Perhaps the most exciting aspect of consecution is its ability to create repeatable and testable data analysis pipelines. Passing Pandas Dataframes through a consecution pipeline makes it very easy to encapsulate any analysis into a well-defined, repeatable process where each node manipulates a dataframe in its prescribed way. Adopting this structure in analysis projects will undoubtedly ease the transition from analysis/research into production. ___ Projects by [robdmc](https://www.linkedin.com/in/robdecarvalho). * [Pandashells](https://github.com/robdmc/pandashells) Pandas at the bash command line * [Consecution](https://github.com/robdmc/consecution) Pipeline abstraction for Python * [Behold](https://github.com/robdmc/behold) Helping debug large Python projects * [Crontabs](https://github.com/robdmc/crontabs) Simple scheduling library for Python scripts * [Switchenv](https://github.com/robdmc/switchenv) Manager for bash environments * [Gistfinder](https://github.com/robdmc/gistfinder) Fuzzy-search your gists ================================================ FILE: consecution/__init__.py ================================================ # flake8: noqa from consecution.nodes import Node, GroupByNode from consecution.pipeline import Pipeline, GlobalState from consecution.utils import Clock __version__ = '0.2.0' ================================================ FILE: consecution/nodes.py ================================================ import sys from collections import Counter, deque, OrderedDict import traceback from consecution.utils import Clock class Node(object): """ :type name: str :param str: The name of this node. Must be unique within a pipeline. :type kwargs: keyword args :param kwargs: Any additional keyword args are assigned as attributes on the node. You create nodes by inheriting from this class. You will be required to implement a `.process()` on your class. You can call the `.push()` method from anywhere in your class implementation except from within the `.begin()` method. Note that although this documentation refers to "the `.push` method", `push` is actually a callable attribute assigned when nodes are placed into pipelines. Its signature is `.push(item)`, where `item` can be anything you want pushed to nodes connected to the downstream side of the node. """ def __init__(self, name, **kwargs): # assign any user-defined attributes for k, v in kwargs.items(): setattr(self, k, v) self.name = name self._upstream_nodes = [] self._downstream_nodes = [] self._num_top_down_calls = 0 # node network can be visualized with pydot. These hold args and kwargs # that will be used to add and connect this node in the graph visualization self._pydot_node_kwargs = dict(name=self.name, shape='rectangle') self._pydot_edge_kwarg_list = [] self._router = None # this will be one of three values: None, 'input', 'output' self._logging = None # add a clock to allow for timing self.clock = Clock() def __str__(self): return 'N({})'.format(self.name) def __repr__(self): return self.__str__() def __hash__(self): """ define __hash__ method. dicts and sets will use this as key """ return id(self) def __eq__(self, other): return self.__hash__() == other.__hash__() def __lt__(self, other): """ I need this to be able to sort by name """ return self.name < other.name def __getitem__(self, key): msg = ( '\n\nYou cannot call __getitem__ on nodes. You tried to call\n' '{self} [{key}]\n' 'which doesn\'t make sense. You probably meant\n' '{self} | [{key}]\n' ).format(self=self, key=key) raise ValueError(msg) def _get_flattened_list(self, obj): if isinstance(obj, Node): return [obj] elif hasattr(obj, '__iter__'): nodes = [] for el in obj: if isinstance(el, Node): nodes.append(el) elif hasattr(el, '__iter__'): nodes.extend(self._get_flattened_list(el)) return nodes else: msg = ( 'Don\'t know what to do with {}. It\'s not a node, and it\'s ' 'not iterable.' ).format(repr(obj)) raise ValueError(msg) def _get_exposed_slots(self, obj, pointing): nodes = set() for node in self._get_flattened_list(obj): if pointing == 'left': nodes = nodes.union(node.initial_node_set) elif pointing == 'right': nodes = nodes.union(node.terminal_node_set) else: raise ValueError('pointing must be "left" or "right"') return nodes def _connect_lefts_to_rights(self, lefts, rights, router=None): slots_from_left = self._get_exposed_slots(lefts, pointing='right') slots_from_right = self._get_exposed_slots(rights, pointing='left') for left in slots_from_left: router_node = None if router: router_name = '{}.{}'.format( left.name, self._get_object_name(router)) end_point_map = {n.name: n for n in slots_from_right} router_node = _RouterNode( router_name, end_point_map, router) left.add_downstream(router_node) for right in slots_from_right: if router_node: router_node.add_downstream(right) else: left.add_downstream(right) def _get_object_name(self, obj): class_name = obj.__class__.__name__ if class_name == 'function': return obj.__name__ else: return class_name def _get_router(self, obj): router = None if hasattr(obj, '__iter__'): routers = [el for el in obj if hasattr(el, '__call__')] router = routers[0] if routers else None return router def __or__(self, other): router = self._get_router(other) self._connect_lefts_to_rights(self, other, router) return self def __ror__(self, other): self._connect_lefts_to_rights(other, self) return self @property def top_node(self): """ This attribute always holds the top-most node in the node graph. Consecution only allows one top node. """ root_nodes = self.root_nodes if len(root_nodes) > 1: msg = 'You must remove one of the following input nodes {}'.format( root_nodes) raise ValueError(msg) else: return root_nodes.pop() @property def terminal_node_set(self): """ This attribute holds a set of all bottom nodes in the node graph. """ return { node for node in self.depth_first_walk('down') if len(node._downstream_nodes) == 0 } @property def initial_node_set(self): """ When piecing together fragments of a graph, you can temporarily have connected nodes with multiple "top-nodes." This method returns this set of nodes. Node that consecution can only make pipelines from graphs having a single top node. """ self.depth_first_walk('up') return { node for node in self.depth_first_walk('up') if len(node._upstream_nodes) == 0 } @property def root_nodes(self): """ This attribute holds a list of all nodes that do not have any upstream nodes attached. """ return [ node for node in self.all_nodes if len(node._upstream_nodes) == 0 ] @property def all_nodes(self): """ This attribute contains a set of all nodes in the graph. """ return self.depth_first_walk('both') def log(self, what): """ Calling this method on a node will turn on its logging feature. This means that the node will print logged items to the console. You can choose whether to log the inputs or outputs of a node. :type name: what :param what: One of 'input' or 'output' indicating whther you want to log the input or output of this node. """ allowed = ['input', 'output'] if what not in allowed: raise ValueError( '\'what\' argument must be in {}'.format(allowed) ) self._logging = what def _get_downstream_reps(self): if self._downstream_nodes: downstreams = sorted([n.name for n in self._downstream_nodes]) if len(downstreams) == 1: downstreams = downstreams[0] template = '{{: >{}s}} | {{}}\n'.format( self.pipeline._longest_node_name_len_) self.pipeline._node_repr += template.format( self.name, downstreams).replace('\'', '') def top_down_make_repr(self): """ You should never need to use this method. It iterates through the node graph in top-down order making a repr string for each node. """ if not hasattr(self, 'pipeline'): raise ValueError( 'top_down_make_repr can only be called for nodes in a pipeline') self.pipeline._longest_node_name_len_ = max( len(n.name) for n in self.all_nodes) self.pipeline._node_repr = '' self.top_node.top_down_call('_get_downstream_reps') def top_down_call(self, method_name): """ This utility method traverses the graph in top-down order and invokes the named method on every node it encounters. It is used internally to make sure the `.begin()` and `.end()` methods are not called before their upstream counterparts. :type method_name: str :param method_name: The name of the method you would like to call in top-down order. """ # record the number of upstreams this node has num_upstreams = len(self._upstream_nodes) # if this node isn't pulling from multiple upstreams, it's ready # to recurse to downstreams if num_upstreams <= 1: ready_for_downstreams = True # this node isn't ready to recurse to downstreams until the current # call would mean the last required call. elif self._num_top_down_calls == num_upstreams - 1: ready_for_downstreams = True else: ready_for_downstreams = False # if ready to recurse, then call the method on self and recurse # downwards. if ready_for_downstreams: getattr(self, method_name)() for downstream in self._downstream_nodes: downstream.top_down_call(method_name) self._num_top_down_calls = 0 else: self._num_top_down_calls += 1 def depth_first_walk(self, direction='both', as_ordered_list=False): """ This method walks the graph of connected nodes in depth-first order. It uses a stack to emulate recursion. See good explanation at https://jeremykun.com/2013/01/22/depth-and-breadth-first-search/ :type direction: str :param direction: one of 'up', 'down' or 'both' specifying the direction to walk. :type as_ordered_list: Bool :param as_ordered_list: If set to true, returns the walked nodes as an ordered list instead of an unordered set. :rtype: list or set :return: An iterable of the discovered nodes. """ return self.walk( direction=direction, how='depth_first', as_ordered_list=as_ordered_list) def breadth_first_walk(self, direction='both', as_ordered_list=False): """ This method walks the graph of connected nodes in breadth-first order. It uses a stack to emulate recursion. See good explanation at https://jeremykun.com/2013/01/22/depth-and-breadth-first-search/ :type direction: str :param direction: one of 'up', 'down' or 'both' specifying the direction to walk. :type as_ordered_list: Bool :param as_ordered_list: If set to true, returns the walked nodes as an ordered list instead of an unordered set. :rtype: list or set :return: An iterable of the discovered nodes. """ return self.walk( direction=direction, how='breadth_first', as_ordered_list=as_ordered_list) def walk( self, direction='both', how='breadth_first', as_ordered_list=False): """ This is the core algorithm for walking a graph in specified order. It is used by the `breadth_first_walk` and `depth_first_walk` methods. :type how: str :param how: one of 'breadth_first' or 'depth_first' :type direction: str :param direction: one of 'up', 'down' or 'both' specifying the direction to walk. :type as_ordered_list: Bool :param as_ordered_list: If set to true, returns the walked nodes as an ordered list instead of an unordered set. :rtype: list or set :return: An iterable of the discovered nodes. """ if how not in {'depth_first', 'breadth_first'}: raise ValueError( '\'how\' argument must be one of ' '[\'depth_first\', \'breadth_first\']' ) # What I really want is an ordered set, which doesn't exist. So I'm # using the keys of an ordered dict to get the functionality I want. # I have no need for the values in this dict, only the keys. visited_nodes = OrderedDict() # holds nodes that still need to be explored queue = deque([self]) # while I still have nodes that need exploring while len(queue) > 0: # get the next node to explore node = queue.pop() # if I've already seen this node, nothing to do, so go to next if node in visited_nodes: continue # Make sure I don't visit this node again # again. I'm using an ordered dict to mimic an ordered set. # I have no need for the value, so set it to None visited_nodes[node] = None neighbor_dict = { 'up': node._upstream_nodes, 'down': node._downstream_nodes, 'both': node._upstream_nodes + node._downstream_nodes, } if direction not in neighbor_dict: raise ValueError( 'direction must be \'up\', \'dowwn\' or \'both\'') neighbors = neighbor_dict[direction] # search all neightbors to this node for unvisited nodes for node in neighbors: # if you find unvisited node, add it to nodes needing visit if node not in visited_nodes: if how == 'breadth_first': queue.appendleft(node) else: queue.append(node) # should have hit all nodes in the graph at this point if as_ordered_list: return list(visited_nodes.keys()) else: return set(visited_nodes.keys()) def _check_for_dups(self): counter = Counter() for node in self.all_nodes: counter.update({node.name: 1}) dups = [name for (name, count) in counter.items() if count > 1] if dups: msg = ( '\n\nNode names must be unique. Dupicates {} found.' ).format(list(dups)) raise ValueError(msg) return def _check_for_cycles(self): self_and_upstreams = self.depth_first_walk('up') downstreams = self.depth_first_walk('down') - {self} common_nodes = self_and_upstreams.intersection(downstreams) if common_nodes: raise ValueError('\n\nYour graph is not acyclic. It has loops.') def _validate_node(self, other): # only nodes allowed to be connected if not isinstance(other, Node): raise ValueError('Trying to connect a non-node type') def add_downstream(self, other): """ You will probably use this method quite a bit. It is used to manually attach a downstream node. :type other: consecution.Node :param other: An instance of the node you want to attach """ self._validate_node(other) self._downstream_nodes.append(other) other._upstream_nodes.append(self) self._check_for_dups() if self.name == other.name: raise ValueError('{} can\'t be downstream to itself'.format(self)) self._check_for_cycles() self._pydot_edge_kwarg_list.append( dict(tail_name=self.name, head_name=other.name)) def remove_downstream(self, other): """ This method removes the given node from being attached as a downstream node. :type other: consecution.Node :param other: An instance of the node you want to remove """ # remove self from the other's upstreams other._upstream_nodes = [ n for n in other._upstream_nodes if n.name != self.name] # remove other from self's downstream nodes self._downstream_nodes = [ n for n in self._downstream_nodes if n.name != other.name] # remove this connection from the pydot kwargs list new_kwargs_list = [] for kwargs in self._pydot_edge_kwarg_list: if kwargs['head_name'] == other.name: continue new_kwargs_list.append(kwargs) self._pydot_edge_kwarg_list = new_kwargs_list def _build_pydot_graph(self): """ This private method builds a pydot graph """ # define kwargs lists for creating the visualization (these are closure vars for function below) node_kwargs_list, edge_kwargs_list = [], [] # define a function to map over all nodes to aggreate viz kwargs def collect_kwargs(node): node_kwargs_list.append(node._pydot_node_kwargs) edge_kwargs_list.extend(node._pydot_edge_kwarg_list) for node in self.all_nodes: collect_kwargs(node) # doing import inside method so that pydot dependency is optional from graphviz import Digraph # create a pydot graph graph = Digraph(comment='pipeline') # create pydot nodes for every node connected to this one for node_kwargs in node_kwargs_list: graph.node(**node_kwargs) # creat pydot edges between all nodes connected to this one for edge_kwargs in edge_kwargs_list: graph.edge(**edge_kwargs) return graph def plot( self, file_name='pipeline', kind='png'): """ This method draws a visualization of your processing graph. You must have graphviz installed on your system for it to work properly. (See install instructions.) If you are running consecution in an Jupyter notebook, you can display an inline visualization of a pipeline by simply making the pipeline be the final expression in a cell. :type file_name: str :param file_name: The name of the image file to generate :type kind: str :param kind: The kind of file to generate (png, pdf) """ graph = self._build_pydot_graph() # define allowed formats for saving the graph visualization ALLOWED_KINDS = {'pdf', 'png'} if kind not in ALLOWED_KINDS: raise ValueError('Only the following kinds are supported: {}'.format(ALLOWED_KINDS)) # set the output format graph.format = kind file_name = file_name.replace('.{}'.format(kind), '') # write the output file try: graph.render(file_name) except RuntimeError: sys.stderr.write( '\n\n' '=========================================================\n' 'Problem executing GraphViz. Make sure you have it\n' 'properly installed.\n' 'http://www.graphviz.org/\n' 'If you are on a mac, you should be able to install it with\n' 'brew install graphviz.\n\n' 'If you are on ubuntu, you can install it with\n' 'apt-get install graphviz\n' '=========================================================\n' '\n\n' ) raise def process(self, item): """ :type item: object :param item: The item this node should process You must override this method with your own logic. """ raise NotImplementedError( ( 'Error in node named {}\n' 'You must define a .process(self, item) method on all nodes' ).format(repr(self.name)) ) def reset(self): """ User can override this to do whatever logic they want. """ def _logged_process(self, item): if self._logging == 'input': self._write_log(item) self.process(item) def _begin(self): try: self.begin() except AttributeError: e = sys.exc_info()[1] tb = sys.exc_info()[2] ( code_file, line_no, method_name, line_txt ) = traceback.extract_tb(tb)[-1] msg = str(e) + ( '\n\nError in .begin() method of \'{}\' node.\n' 'Are you trying to call .push() from inside the\n' '.begin() method? That is not allowed.\n\n' 'file: {}, line{}\n--> {}\n\n' ).format(self.name, code_file, line_no, line_txt) traceback.print_exc() raise AttributeError(msg) def begin(self): pass def end(self): pass def _write_log(self, item): sys.stdout.write('node_log,{},{},{}\n'.format(self._logging, self.name, item)) def _push(self, item): """ This is the default pusher. It pushes to all downstreams. """ if self._logging == 'output': self._write_log(item) # The _process attribute will be set to the appropriate callable # when initializing the pipeline. I do this because I want the # chaining to be as efficient as possible. If logging is not set, # I don't want to have to hit that logic every push, so I just # invoke a callable attribute at each process that has been set # to the appropriate callable. for downstream in self._downstream_nodes: downstream._process(item) class _RouterNode(Node): """ This node will route to downstreams. The router function needs to return the name of the destination node. """ def __init__(self, name, end_point_map, route_callable): super(_RouterNode, self).__init__(name) self._end_point_map = end_point_map self._pydot_node_kwargs = dict(name=self.name, shape='oval') self._route_callable = route_callable def process(self, item): """ This is the default pusher. It pushes to all downstreams. """ node = self._end_point_map.get(self._route_callable(item), None) if node is None: raise ValueError( ( '\n\nRouter node {} encountered bad route path {}. Valid ' 'route paths are {}.' ).format( self.name, repr(self._route_callable(item)), [n.name for n in self._downstream_nodes] ) ) node._process(item) class GroupByNode(Node): def __init__(self, *args, **kwargs): super(GroupByNode, self).__init__(*args, **kwargs) self._batch_ = [] self._previous_key = '__no_previous_key__' def key(self, item): """ You must define this method. :type item: object :param item: The item you are processing :rtype: hashable object :return: a hashable object that serves as a key for the grouping process """ raise NotImplementedError( 'you must define a .key(self, item) method on all ' 'GroupBy nodes.' ) def process(self, batch): """ You must define this method. :type batch: iterable :param batch: A batch of items having the same key """ raise NotImplementedError( 'You must define a .process(self, batch) method on all GroupBy ' 'nodes.' ) def _process_item(self, item): key = self.key(item) if key != self._previous_key: self._previous_key = key if len(self._batch_) > 0: self.process(self._batch_) self._batch_ = [item] else: self._batch_.append(item) def _end(self): self.process(self._batch_) self._batch_ = [] def __getattribute__(self, name): """ This should trap for the end() method calls and install pre hook. """ if name == 'end': def wrapper(): self._end() return super(GroupByNode, self).__getattribute__(name)() return wrapper else: return super(GroupByNode, self).__getattribute__(name) ================================================ FILE: consecution/pipeline.py ================================================ import sys from consecution.nodes import GroupByNode class GlobalState(object): """ GlobalState is a simple container class that sets its attributes from constructor kwargs. It supports both object and dictionary access to its attributes. So, for example, all of the following statements are supported. .. code-block:: python from consecution import GlobalState global_state = GlobalState(a=1, b=2) global_state['c'] = 2 a = global_state['a'] An object of this class will be created as the default ``.global_state`` attribute on a Pipeline if you do not explicitely provide a global_state argument to the constructor. """ # I'm using unconventional "_item_self_" name here to avoid # conflicts when kwargs actually contain a "self" arg. def __init__(_item_self, **kwargs): for key, val in kwargs.items(): _item_self[key] = val def __str__(_item_self): quoted_keys = [ '\'{}\''.format(k) for k in sorted(vars(_item_self).keys())] att_string = ', '.join(quoted_keys) return 'GlobalState({})'.format(att_string) def __repr__(_item_self): return _item_self.__str__() def __setitem__(_item_self, key, value): setattr(_item_self, key, value) def __getitem__(_item_self, key): return getattr(_item_self, key) class Pipeline(object): """ :type node: Node :param node: Any node in a connected graph :type global_state: object :param global_state: Any python object you want to use for holding global state. Once Nodes have been wired together, they must be placed in a pipeline in order to process data. If you would like to peform pipeline-level set up and tear-down logic, you can subclass from Pipeline and override the ``.begin()`` and ``end()`` methods. """ def __init__(self, node, global_state=None): # get a reference to the top node of the connected nodes supplied. self.top_node = node.top_node # set the pipeline global state if global_state: self.global_state = global_state else: self.global_state = GlobalState() # initialize an empty lookup for nodes self._node_lookup = {} # initialize the pipeline self.initialize() def initialize(self, with_push=False): # define a flag to determine if the pipeline is "running" or not # it will only be true between when the .begin() is run and the # .end() method is run. self._is_running = False self._needs_log_header = False # initialize each node for node in self.top_node.all_nodes: self.initialize_node(node, with_push) # build the pipeline repr by cycling through all the nodes self.top_node.top_down_make_repr() # print a logging header if any node is logging if self._needs_log_header: sys.stdout.write('node_log,what,node_name,item\n') def initialize_node(self, node, with_push=False): # give node reference to pipeline attributes node.pipeline = self node.global_state = self.global_state # make node available for lookup self._node_lookup[node.name] = node # set the _process callable to be either logged or unlogged # TODO: might want to change this logic so that groupby nodes # can be logged if isinstance(node, GroupByNode): node._process = node._process_item elif node._logging is None: node._process = node.process else: self._needs_log_header = True node._process = node._logged_process # for single downstreams with no logging, can short-circuit all logic # and directly wire up the downstream process() callable as the # push callable on this node short_it = len(node._downstream_nodes) == 1 short_it = short_it and node._downstream_nodes[0]._logging is None short_it = short_it and not isinstance( node._downstream_nodes[0], GroupByNode) # only initialize push if requsted if with_push: if short_it and node._logging is None: node.push = node._downstream_nodes[0].process # logged or multiple downstreams require logic, so no short circuit else: node.push = node._push def __getitem__(self, name): node = self._node_lookup.get(name, None) if node is None: raise KeyError('No node named \'{}\''.format(name)) return node def __setitem__(self, name_to_replace, replacement_node): # make sure replacement node has proper name if name_to_replace != replacement_node.name: raise ValueError( 'Replacement node must have the same name.' ) # this will automatically raise error if the name doesn't exist node_to_replace = self[name_to_replace] removals = [] additions = [] for upstream in node_to_replace._upstream_nodes: removals.append((upstream, node_to_replace)) additions.append((upstream, replacement_node)) # handle special case of upstream being a routing node if hasattr(upstream, '_end_point_map'): upstream._end_point_map[name_to_replace] = replacement_node for downstream in node_to_replace._downstream_nodes: removals.append((node_to_replace, downstream)) additions.append((replacement_node, downstream)) for upstream, downstream in removals: upstream.remove_downstream(downstream) for upstream, downstream in additions: upstream.add_downstream(downstream) # initialize the replacement node within the pipeline self.initialize_node(replacement_node) # if top node was replaced then make sure pipeline nows about it if replacement_node.name == self.top_node.name: self.top_node = replacement_node def __getattribute__(self, name): """ This should trap for the begin() and end() method calls and install pre/post hooks for when they are called either on the pipeline class or on any class derived from it. """ if name == 'begin': def wrapper(): super(Pipeline, self).__getattribute__(name)() self._begin() return wrapper elif name == 'end': def wrapper(): self._end() return super(Pipeline, self).__getattribute__(name)() return wrapper elif name == 'reset': def wrapper(): self._reset() return super(Pipeline, self).__getattribute__(name)() return wrapper else: return super(Pipeline, self).__getattribute__(name) def begin(self): """ Override this method to execute any logic you want to perform before setting up nodes. The ``.begin()`` method of all nodes will be called. """ def end(self): """ Override this method to execute any logic you want to perform after all nodes are done processing data. The ``.end()`` method of all nodes will be called. """ def reset(self): """ Override this with any logic you'd like to perform for resetting the pipeline. The ``.reset()`` method of all nodes will be called. """ def _reset(self): self.top_node.top_down_call('reset') def _begin(self): self.top_node.top_down_call('_begin') self.initialize(with_push=True) self._is_running = True def _end(self): self.top_node.top_down_call('end') self._is_running = False def push(self, item): """ You can manually push items to your pipeline using this meethod. :type item: object :param item: Any object you would like the pipeline to process """ if not self._is_running: self.begin() self.top_node._process(item) def consume(self, iterable): """ The pipeline will process each item in the iterable. :type iterable: A Python Iterable :param iterable: An iterable of objects you would like to process """ self.begin() for item in iterable: self.top_node._process(item) return self.end() def plot(self, file_name='pipeline', kind='png'): """ Call this method to produce a visualization of your pipeline. The Graphviz library will be used to generate the image file. Note that pipelines are automatically visualized in IPython notebook when they are evaluated as the last expression in a cell. :type file_name: str :param file_name: The name of the image file to save :type kind: str :param kind: The type of image file to produce (png, pdf) """ self.top_node.plot(file_name, kind) return self def __str__(self): return ( '\nPipeline\n' '----------------------------------' '----------------------------------\n{}' '----------------------------------' '----------------------------------\n' ).format(self._node_repr) def __repr__(self): return self.__str__() # No good way to test this unless you know dot is installed. def _repr_svg_(self): # pragma: no cover return self.top_node._build_pydot_graph()._repr_svg_() ================================================ FILE: consecution/tests/__init__.py ================================================ ================================================ FILE: consecution/tests/nodes_tests.py ================================================ import os from collections import namedtuple import shutil import tempfile from unittest import TestCase import subprocess from mock import patch from consecution.nodes import Node def dot_installed(): p = subprocess.Popen( ['bash', '-c', 'which dot'], stdout=subprocess.PIPE) p.wait() result = p.stdout.read().decode("utf-8") return 'dot' in result class FakeDigraph(object): # pragma: no cover def __init__(self, *args, **kwargs): pass def node(self, *args, **kwargs): pass def edge(self, *args, **kwargs): pass def render(self, *args, **kwargs): raise RuntimeError('fake runtime error') class NodeUnitTests(TestCase): def test_bad_logging_args(self): n = Node('a') with self.assertRaises(ValueError): n.log('bad') def test_bad_top_down_make_repr_call(self): n = Node('a') with self.assertRaises(ValueError): n.top_down_make_repr() def test_args_as_atts(self): n = Node('my_node', silly_attribute='silly') self.assertEqual(n.silly_attribute, 'silly') def test_comparisons(self): a = Node('a') b = Node('b') self.assertTrue(a == a) self.assertFalse(a == b) self.assertTrue(a < b) self.assertFalse(b < a) def test_bad_flattening(self): a = Node('a') with self.assertRaises(ValueError): a | 7 @patch( 'consecution.nodes.Node._build_pydot_graph', lambda a: FakeDigraph()) def test_graphviz_not_installed(self): a = Node('a') b = Node('b') p = a | b with self.assertRaises(RuntimeError): p.plot() def test_no_getitem(self): a = Node('a') with self.assertRaises(ValueError): a['b'] def test_bad_slot_name(self): a = Node('a') b = Node('b') with self.assertRaises(ValueError): a._get_exposed_slots(b, 'bad_arg') class ExplicitWiringTests(TestCase): def setUp(self): self.temp_dir = tempfile.mkdtemp() def tearDown(self): shutil.rmtree(self.temp_dir) def do_wiring(self): self.do_explicit_wiring() def do_explicit_wiring(self): # define nodes a = Node('a') b = Node('b') c = Node('c') d = Node('d') e = Node('e') f = Node('f') g = Node('g') h = Node('h') i = Node('i') j = Node('j') k = Node('k') l = Node('l') # noqa. okay to use l as var here m = Node('m') n = Node('n') # save a list of all nodes self.node_list = [a, b, c, d, e, f, g, h, i, j, k, l, m, n] self.top_node = a # wire up the nodes a.add_downstream(b) a.add_downstream(c) c.add_downstream(d) c.add_downstream(e) e.add_downstream(f) e.add_downstream(g) e.add_downstream(h) e.add_downstream(i) f.add_downstream(j) g.add_downstream(j) h.add_downstream(j) i.add_downstream(j) d.add_downstream(k) j.add_downstream(k) b.add_downstream(l) k.add_downstream(l) l.add_downstream(m) l.add_downstream(n) # same network in graph notation # a | [ # b, # c | [ # d, # e | [f, g, h, i, my_router] | j # ] | k # ] | l [m, n] def do_graph_wiring(self): # define nodes a = Node('a') b = Node('b') c = Node('c') d = Node('d') e = Node('e') f = Node('f') g = Node('g') h = Node('h') i = Node('i') j = Node('j') k = Node('k') l = Node('l') # noqa. okay to use l as var here m = Node('m') n = Node('n') # save a list of all nodes self.node_list = [a, b, c, d, e, f, g, h, i, j, k, l, m, n] self.top_node = a a | [ # noqa b, c | [ d, e | [f, g, h, i] | j ] | k ] | l | [m, n] def test_connections(self): Conns = namedtuple('Conns', 'node upstreams downstreams') self.do_wiring() n = { node.name: Conns( node.name, {u.name for u in node._upstream_nodes}, {d.name for d in node._downstream_nodes} ) for node in self.node_list } self.assertEqual(n['a'].upstreams, set()) self.assertEqual(n['a'].downstreams, {'b', 'c'}) self.assertEqual(n['b'].upstreams, {'a'}) self.assertEqual(n['b'].downstreams, {'l'}) self.assertEqual(n['c'].upstreams, {'a'}) self.assertEqual(n['c'].downstreams, {'d', 'e'}) self.assertEqual(n['e'].upstreams, {'c'}) self.assertEqual(n['e'].downstreams, {'f', 'g', 'h', 'i'}) self.assertEqual(n['f'].upstreams, {'e'}) self.assertEqual(n['f'].downstreams, {'j'}) self.assertEqual(n['g'].upstreams, {'e'}) self.assertEqual(n['g'].downstreams, {'j'}) self.assertEqual(n['h'].upstreams, {'e'}) self.assertEqual(n['h'].downstreams, {'j'}) self.assertEqual(n['i'].upstreams, {'e'}) self.assertEqual(n['i'].downstreams, {'j'}) self.assertEqual(n['d'].upstreams, {'c'}) self.assertEqual(n['d'].downstreams, {'k'}) self.assertEqual(n['j'].upstreams, {'f', 'g', 'h', 'i'}) self.assertEqual(n['j'].downstreams, {'k'}) self.assertEqual(n['k'].upstreams, {'j', 'd'}) self.assertEqual(n['k'].downstreams, {'l'}) self.assertEqual(n['l'].upstreams, {'k', 'b'}) self.assertEqual(n['l'].downstreams, {'m', 'n'}) def test_all_nodes(self): self.do_wiring() expected_set = set(self.node_list) all_nodes_set = [ set(node.all_nodes) for node in self.node_list ] self.assertTrue(all( [expected_set == found_set for found_set in all_nodes_set])) def test_top_node(self): self.do_wiring() top_node_set = {node.top_node for node in self.node_list} self.assertEqual(top_node_set, {self.top_node}) def test_duplicate_node(self): self.do_wiring() # this test is funky in that it has assertion in a loop. # but I wanted to be sure cycles are detected everywhere for name in [n.name for n in self.top_node.all_nodes]: dup = Node(name) with self.assertRaises(ValueError): self.top_node.add_downstream(dup) def test_acyclic(self): self.do_wiring() # this test is funky in that it has assertion in a loop. # but I wanted to be sure dups are detected everywhere for node in self.top_node.all_nodes: with self.assertRaises(ValueError): node.add_downstream(self.top_node) def test_multi_root(self): self.do_wiring() other_root = Node('dual_root') other_root.add_downstream(self.top_node._downstream_nodes[0]) with self.assertRaises(ValueError): other_root.top_node def test_non_node_connect(self): node = Node('a') other = 'not a node' with self.assertRaises(ValueError): node.add_downstream(other) def test_write(self): # don't run coverage on this because won't test travis with # both dot installed and not installed. if dot_installed(): # pragma: no cover self.do_wiring() out_file = os.path.join(self.temp_dir, 'out.png') self.top_node.plot(out_file) # uncomment the next line if you want to look at the graph os.system('cp {} /tmp'.format(out_file)) def test_write_bad_kind(self): self.do_wiring() with self.assertRaises(ValueError): self.top_node.plot(kind='bad') def test_bad_search_direction(self): self.do_wiring() with self.assertRaises(ValueError): self.top_node.breadth_first_walk(direction='bad') def test_bad_search_method(self): self.do_wiring() with self.assertRaises(ValueError): self.top_node.walk(how='bad') class DSLWiringTests(ExplicitWiringTests): def do_wiring(self): self.do_graph_wiring() class TopDownCallTests(TestCase): def test_call_order_okay(self): # a toy class that holds a class variable # tracking what order objects get called in class MyNode(Node): call_list = [] def end(self): self.__class__.call_list.append(self) a = MyNode('a') b = MyNode('b') c = MyNode('c') d = MyNode('d') e = MyNode('e') f = MyNode('f') g = MyNode('g') a | [ b | c, d | e | f ] | g a.top_node.top_down_call('end') # make a dictionary with order in which nodes # were called call_number = { node: ind for (ind, node) in enumerate(a.__class__.call_list)} # make sure ording of one branch is right self.assertTrue(call_number[a] < call_number[b]) self.assertTrue(call_number[b] < call_number[c]) self.assertTrue(call_number[c] < call_number[g]) # make sure ordering of other branch is okay self.assertTrue(call_number[a] < call_number[d]) self.assertTrue(call_number[d] < call_number[e]) self.assertTrue(call_number[e] < call_number[f]) self.assertTrue(call_number[f] < call_number[g]) class BreadthFirstSearchTests(TestCase): def test_top_down_order(self): a = Node('a') b = Node('b') c = Node('c') d = Node('d') e = Node('e') f = Node('f') h = Node('h') i = Node('i') def silly_router(item): # pragma: no cover return 0 a | [b, c] | [d, e, f, silly_router] | [h, i] nodes = a.top_node.breadth_first_walk( direction='down', as_ordered_list=True) level5 = {nodes.pop() for nn in range(2)} level4 = {nodes.pop() for nn in range(3)} level3 = {nodes.pop() for nn in range(2)} level2 = {nodes.pop() for nn in range(2)} level1 = {nodes.pop() for nn in range(1)} self.assertEqual(level1, {a}) self.assertEqual(level2, {b, c}) self.assertEqual(len(level3), 2) self.assertEqual(level4, {d, e, f}) self.assertEqual(level5, {h, i}) def test_bottom_up_order(self): a = Node('a') b = Node('b') c = Node('c') d = Node('d') e = Node('e') f = Node('f') h = Node('h') def silly_router(item): # pragma: no cover return 0 a | [b, c] | [d, e, f, silly_router] | h nodes = h.breadth_first_walk(direction='up', as_ordered_list=True) nodes = nodes[::-1] level5 = {nodes.pop() for nn in range(1)} level4 = {nodes.pop() for nn in range(3)} level3 = {nodes.pop() for nn in range(2)} level2 = {nodes.pop() for nn in range(2)} level1 = {nodes.pop() for nn in range(1)} self.assertEqual(level1, {a}) self.assertEqual(level2, {b, c}) self.assertEqual(len(level3), 2) self.assertEqual(level4, {d, e, f}) self.assertEqual(level5, {h}) class PrintingTests(TestCase): def setUp(self): # define nodes a = Node('a') b = Node('b') c = Node('c') d = Node('d') e = Node('e') f = Node('f') g = Node('g') h = Node('h') i = Node('i') j = Node('j') k = Node('k') l = Node('l') # noqa okay to use l here m = Node('m') n = Node('n') class DummyPipeline(object): pass pipeline = DummyPipeline() # save a list of all nodes self.node_list = [a, b, c, d, e, f, g, h, i, j, k, l, m, n] self.top_node = a def my_router(item): # pragma: no cover return 'm' # wire up nodes using dsl a | [ b, # noqa c | [ d, e | [f, g, h, i] | j ] | k ] | l | [m, n, my_router] for node in self.top_node.all_nodes: node.pipeline = pipeline def test_nothing(self): self.top_node.top_down_make_repr() lines = sorted([ line.strip() for line in self.top_node.pipeline._node_repr.split('\n') if line.strip() ]) expected_lines = sorted([ 'a | [b, c]', 'b | l', 'c | [d, e]', 'd | k', 'e | [f, g, h, i]', 'f | j', 'g | j', 'h | j', 'i | j', 'j | k', 'k | l', 'l | l.my_router', 'l.my_router | [m, n]', ]) self.assertEqual(lines, expected_lines) class RoutingTests(TestCase): def test_nothing(self): a = Node('a') b = Node('b') c = Node('c') d = Node('d') e = Node('e') def silly_router(item): # pragma: no cover return 0 class ClassRouter(object): # pragma: no cover def __call__(self, arg): return arg a | [b, c, ClassRouter()] | [d, e, silly_router] ================================================ FILE: consecution/tests/pipeline_tests.py ================================================ from __future__ import print_function from collections import namedtuple, Counter from unittest import TestCase from consecution.nodes import Node, GroupByNode from consecution.pipeline import Pipeline, GlobalState from consecution.tests.testing_helpers import print_catcher Item = namedtuple('Item', 'value parent source') class Item(object): # pragma: no cover (just a testing helper) def __init__(self, value, parent, source): self.value = value self.parent = parent self.source = source def build_source_list(self, source_list=None): source_list = [] if source_list is None else source_list source_list.append(self.source) if self.parent: self.parent.build_source_list(source_list) return source_list def get_path_string(self): return '|'.join([str(self.value)] + self.build_source_list()[::-1]) def __str__(self): return self.get_path_string() def __repr__(self): return self.get_path_string() class TestNode(Node): def process(self, item): self.push( Item(value=item.value, parent=item, source=self.name) ) class ResultNode(Node): def process(self, item): self.global_state.final_items.append(item) class BadNode(Node): def begin(self): self.push(1) def process(self, item): # pragma: no cover this should never get hit. self.push(item) def item_generator(): for ind in range(1, 3): yield Item( value=ind, parent=None, source='generator' ) class TestBase(TestCase): def setUp(self): a = TestNode('a') b = TestNode('b') c = TestNode('c') d = TestNode('d') even = TestNode('even') odd = TestNode('odd') g = TestNode('g') def even_odd(item): return ['even', 'odd'][item.value % 2] a | b | [c, d] | [even, odd, even_odd] | g self.pipeline = Pipeline(a, global_state=GlobalState(final_items=[])) class GlobalStateUnitTests(TestCase): def test_kwargs_passed(self): g = GlobalState(custom_name='custom') p = Pipeline(TestNode('a'), global_state=g) self.assertTrue(p.global_state.custom_name == 'custom') self.assertTrue(p.global_state['custom_name'] == 'custom') def test_printing(self): g = GlobalState(custom_name='custom') with print_catcher() as catcher1: print(g) with print_catcher() as catcher2: print(repr(g)) self.assertTrue( 'GlobalState(\'custom_name\')' in catcher1.txt) self.assertTrue( 'GlobalState(\'custom_name\')' in catcher2.txt) class OrOpTests(TestCase): def test_ror(self): a = Node('a') b = Node('b') c = Node('c') d = Node('d') p = Pipeline(a | ([b, c] | d)) with print_catcher() as catcher: print(p) self.assertTrue('a | [b, c]' in catcher.txt) self.assertTrue('c | d' in catcher.txt) self.assertTrue('b | d' in catcher.txt) class ManualFeedTests(TestCase): def test_manual_feed(self): class N(Node): def begin(self): self.global_state.out_list = [] def process(self, item): self.global_state.out_list.append(item) pipeline = Pipeline(TestNode('a') | N('b')) pushed_list = [] for item in item_generator(): pushed_list.append(item) pipeline.push(item) pipeline.end() self.assertEqual(len(pipeline.global_state.out_list), 2) class PipelineUnitTests(TestCase): def test_push_in_begin(self): pipeline = Pipeline(BadNode('a') | TestNode('b')) with self.assertRaises(AttributeError): pipeline.begin() def test_no_process(self): class N(Node): pass pipe = Pipeline(N('a') | N('b')) with self.assertRaises(NotImplementedError): pipe.consume(range(3)) def test_bad_route(self): def bad_router(item): return 'bad' class N(Node): def process(self, item): self.push(item) pipeline = Pipeline(N('a') | [N('b'), N('c'), bad_router]) with self.assertRaises(ValueError): pipeline.consume(range(3)) def test_bad_node_lookup(self): pipeline = Pipeline(TestNode('a') | TestNode('b')) with self.assertRaises(KeyError): pipeline['c'] def test_bad_replacement_name(self): pipeline = Pipeline(TestNode('a') | TestNode('b')) with self.assertRaises(ValueError): pipeline['b'] = TestNode('c') def test_flattened_list(self): pipeline = Pipeline( TestNode('a') | [[Node('b'), Node('c')]]) with print_catcher() as catcher: print(pipeline) self.assertTrue('a | [b, c]' in catcher.txt) def test_logging(self): pipeline = Pipeline(TestNode('a') | TestNode('b')) pipeline['a'].log('output') pipeline['b'].log('input') with print_catcher() as catcher: pipeline.consume(item_generator()) text = """ node_log,what,node_name,item node_log,output,a,1|generator|a node_log,input,b,1|generator|a node_log,output,a,2|generator|a node_log,input,b,2|generator|a """ for line in text.split('\n'): self.assertTrue(line.strip() in catcher.txt) def test_reset(self): class N(Node): def begin(self): self.was_reset = False def process(self, item): self.push(item) def reset(self): self.was_reset = True pipe = Pipeline(N('a') | N('b')) pipe.consume(range(3)) self.assertFalse(pipe['a'].was_reset) self.assertFalse(pipe['b'].was_reset) pipe.reset() self.assertTrue(pipe['a'].was_reset) self.assertTrue(pipe['b'].was_reset) class LoggingTests(TestBase): def test_logging(self): self.pipeline['g'].log('input') with print_catcher() as printer: self.pipeline.consume(item_generator()) counter = Counter() for line in printer.lines(): even_odd = line.split('|')[-1] counter.update({even_odd: 1}) self.assertEqual(counter['even'], 2) self.assertEqual(counter['odd'], 2) class ReplacementTests(TestBase): def test_replace_first(self): class Replacement(Node): def process(self, item): self.push( Item(value=10 * item.value, parent=item, source=self.name) ) self.pipeline['a'] = Replacement('a') self.pipeline['a'].log('output') with print_catcher() as printer: self.pipeline.consume(item_generator()) self.assertEqual(printer.txt.count('10'), 1) self.assertEqual(printer.txt.count('20'), 1) def test_replace_even(self): class Replacement(Node): def process(self, item): self.push( Item(value=10 * item.value, parent=item, source=self.name) ) self.pipeline['even'] = Replacement('even') self.pipeline['g'].log('output') with print_catcher() as printer: self.pipeline.consume(item_generator()) self.assertEqual(printer.txt.count('1'), 2) self.assertEqual(printer.txt.count('20'), 2) def test_replace_no_router(self): a = TestNode('a') b = TestNode('b') pipe = Pipeline(a | b) pipe['b'] = TestNode('b') with print_catcher() as catcher: print(pipe) self.assertTrue('a | b' in catcher.txt) class ConsumingTests(TestBase): def test_even_odd(self): self.pipeline['g'].add_downstream( ResultNode('result_node') ) self.pipeline.consume(item_generator()) expected_path_set = set([ '1|generator|a|b|c|odd|g', '1|generator|a|b|d|odd|g', '2|generator|a|b|c|even|g', '2|generator|a|b|d|even|g', ]) path_set = set( item.get_path_string() for item in self.pipeline.global_state.final_items ) self.assertEqual(expected_path_set, path_set) class ConstructingTests(TestBase): def test_printing(self): lines = repr(self.pipeline).split('\n') self.assertEqual(len(lines), 13) def test_plotting(self): # don't want to force a mock dependency, so make a simple mock here args_kwargs = [] def return_calls(*args, **kwargs): args_kwargs.append(args) args_kwargs.append(kwargs) # assign my mock to the top node plot function self.pipeline.top_node.plot = return_calls # call pipeline plot self.pipeline.plot() # make sure top node plot was properly called self.assertEqual(args_kwargs[0], ('pipeline', 'png')) self.assertEqual(args_kwargs[1], {}) class Batch(GroupByNode): def begin(self): self.global_state.batches = [] def key(self, item): return item // 3 def process(self, batch): self.global_state.batches.append(batch) class GroupByTests(TestCase): def test_batching(self): pipe = Pipeline(Batch('a')) pipe.consume(range(9)) self.assertEqual( pipe.global_state.batches, [[0, 1, 2], [3, 4, 5], [6, 7, 8]] ) def test_undefined_key(self): class B(GroupByNode): def process(self, item): # pragma: no cover pass pipe = Pipeline(B('a')) with self.assertRaises(NotImplementedError): pipe.consume(range(9)) def test_undefined_process(self): class B(GroupByNode): def key(self, item): pass pipe = Pipeline(B('a')) with self.assertRaises(NotImplementedError): pipe.consume(range(9)) ================================================ FILE: consecution/tests/testing_helpers.py ================================================ import sys from contextlib import contextmanager # These don't need to covered. They are just tesing utilities @contextmanager def print_catcher(buff='stdout'): # pragma: no cover if buff == 'stdout': sys.stdout = Printer() yield sys.stdout sys.stdout = sys.__stdout__ elif buff == 'stderr': sys.stderr = Printer() yield sys.stderr sys.stderr = sys.__stderr__ else: # pragma: no cover This is just to help testing. No need to cover. raise ValueError('buff must be either \'stdout\' or \'stderr\'') class Printer(object): # pragma: no cover def __init__(self): self.txt = "" def write(self, txt): self.txt += txt def lines(self): for line in self.txt.split('\n'): yield line.strip() ================================================ FILE: consecution/tests/utils_tests.py ================================================ from __future__ import print_function from unittest import TestCase from consecution.utils import Clock import time from consecution.tests.testing_helpers import print_catcher class ClockTests(TestCase): def test_bad_start(self): clock = Clock() with self.assertRaises(ValueError): clock.start() def test_printing(self): clock = Clock() with clock.running('a', 'b', 'c'): with clock.paused('a'): time.sleep(.1) with clock.paused('b'): time.sleep(.1) with print_catcher() as printer: print(repr(clock)) names = [] for ind, line in enumerate(printer.txt.split('\n')): if line: if ind > 0: names.append(line.split()[-1]) self.assertEqual(names, ['c', 'b', 'a']) def test_get_time_of_running(self): clock = Clock() with clock.running('a'): time.sleep(.1) delta1 = int(10 * clock.get_time()) time.sleep(.1) delta2 = int(10 * clock.get_time()) self.assertEqual(delta1, 1) self.assertEqual(delta2, 2) def test_pausing(self): clock = Clock() with clock.running('a', 'b', 'c'): time.sleep(.1) with clock.paused('b', 'c'): time.sleep(.1) self.assertEqual(int(10 * clock.get_time('a')), 2) self.assertEqual(int(10 * clock.get_time('b')), 1) self.assertEqual(int(10 * clock.get_time('c')), 1) self.assertEqual( {int(10 * v) for v in clock.get_time().values()}, {1, 2} ) def test_stop_all(self): clock = Clock() clock.start('a', 'b') time.sleep(.1) clock.stop() self.assertEqual(int(10 * clock.get_time('a')), 1) self.assertEqual(int(10 * clock.get_time('b')), 1) def test_reset_all(self): clock = Clock() clock.start('a', 'b') time.sleep(.1) clock.stop('b') self.assertEqual(len(clock.delta), 1) clock.reset() self.assertEqual(len(clock.get_time()), 0) def test_double_calls(self): clock = Clock() clock.start('a') clock.start('a') time.sleep(.1) clock.stop('a') clock.stop('a') self.assertEqual(int(round(10 * clock.get_time())), 1) clock.reset('a') clock.reset('a') clock.reset('b') clock.reset('b') self.assertEqual(clock.get_time(), {}) def test_get_time_delta_only(self): clock = Clock() clock.start('a') clock.stop('a') self.assertEqual(clock.get_time('f'), {}) ================================================ FILE: consecution/utils.py ================================================ from collections import Counter from contextlib import contextmanager import datetime class Clock(object): def __init__(self): # see the reset method for instance attributes self.delta = Counter() self.active_start_times = dict() @contextmanager def running(self, *names): self.start(*names) yield self.stop(*names) @contextmanager def paused(self, *names): self.stop(*names) yield self.start(*names) def start(self, *names): if not names: raise ValueError('You must provide at least one name to start') for name in names: if name not in self.active_start_times: self.active_start_times[name] = datetime.datetime.now() def stop(self, *names): ending = datetime.datetime.now() if not names: names = list(self.active_start_times.keys()) for name in names: if name in self.active_start_times: starting = self.active_start_times.pop(name) self.delta.update({name: (ending - starting).total_seconds()}) def reset(self, *names): if not names: names = list(self.active_start_times.keys()) names.extend(list(self.delta.keys())) for name in names: if name in self.delta: self.delta.pop(name) if name in self.active_start_times: self.active_start_times.pop(name) def get_time(self, *names): ending = datetime.datetime.now() if not names: names = list(self.delta.keys()) names.extend(list(self.active_start_times.keys())) delta = Counter() for name in names: if name in self.delta: delta.update({name: self.delta[name]}) elif name in self.active_start_times: delta.update( { name: ( ending - self.active_start_times[name] ).total_seconds() } ) if len(delta) == 1: return delta[list(delta.keys())[0]] else: return dict(delta) def __str__(self): records = sorted(self.delta.items(), key=lambda t: t[1], reverse=True) records = [('%0.6f' % r[1], r[0]) for r in records] out_list = ['{: <15s}{}'.format('seconds', 'name')] for rec in records: out_list.append('{: <15s}{}'.format(*rec)) return '\n'.join(out_list) def __repr__(self): return self.__str__() ================================================ FILE: docker/Dockerfile ================================================ FROM ubuntu:xenial # root is the home directory WORKDIR /root ADD simple_example.py /root/simple_example.py # set up the system tools including conda RUN \ rm /bin/sh && ln -s /bin/bash /bin/sh && \ apt-get update && \ apt-get install -y vim && \ apt-get install -y git && \ apt-get install -y wget && \ apt-get install -y curl && \ apt-get install -y graphviz && \ apt-get install -y python-dev RUN \ curl -sS https://bootstrap.pypa.io/get-pip.py | python RUN \ pip install git+https://github.com/robdmc/consecution.git ================================================ FILE: docker/docker_build.sh ================================================ #! /usr/bin/env bash docker build . -t consecution ================================================ FILE: docker/docker_run.sh ================================================ #! /usr/bin/env bash docker run -it --rm -v $(pwd):/root/shared consecution /bin/bash ================================================ FILE: docker/simple_example.py ================================================ #! /usr/bin/env python # TODO: make the consecution install in the docker file read from pip from __future__ import print_function from consecution import Node, Pipeline class N(Node): def process(self, item): print(item, self.name) self.push(item) p = Pipeline( N('a') | [N('b'), N('c')] | N('d') ) p.plot() p.consume(range(5)) ================================================ FILE: docs/Makefile ================================================ # Makefile for Sphinx documentation # # You can set these variables from the command line. SPHINXOPTS = SPHINXBUILD = sphinx-build PAPER = BUILDDIR = _build # User-friendly check for sphinx-build ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1) $(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/) endif # Internal variables. PAPEROPT_a4 = -D latex_paper_size=a4 PAPEROPT_letter = -D latex_paper_size=letter ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . # the i18n builder cannot share the environment and doctrees with the others I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . .PHONY: help clean html dirhtml singlehtml pickle json htmlhelp epub latex latexpdf text man changes linkcheck doctest gettext help: @echo "Please use \`make ' where is one of" @echo " html to make standalone HTML files" @echo " dirhtml to make HTML files named index.html in directories" @echo " singlehtml to make a single large HTML file" @echo " pickle to make pickle files" @echo " json to make JSON files" @echo " htmlhelp to make HTML files and a HTML help project" @echo " epub to make an epub" @echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter" @echo " latexpdf to make LaTeX files and run them through pdflatex" @echo " latexpdfja to make LaTeX files and run them through platex/dvipdfmx" @echo " text to make text files" @echo " man to make manual pages" @echo " texinfo to make Texinfo files" @echo " info to make Texinfo files and run them through makeinfo" @echo " gettext to make PO message catalogs" @echo " changes to make an overview of all changed/added/deprecated items" @echo " xml to make Docutils-native XML files" @echo " pseudoxml to make pseudoxml-XML files for display purposes" @echo " linkcheck to check all external links for integrity" @echo " doctest to run all doctests embedded in the documentation (if enabled)" clean: rm -rf $(BUILDDIR)/* html: $(SPHINXBUILD) -W -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/html." dirhtml: $(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml." singlehtml: $(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml @echo @echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml." pickle: $(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle @echo @echo "Build finished; now you can process the pickle files." json: $(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json @echo @echo "Build finished; now you can process the JSON files." htmlhelp: $(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp @echo @echo "Build finished; now you can run HTML Help Workshop with the" \ ".hhp project file in $(BUILDDIR)/htmlhelp." epub: $(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub @echo @echo "Build finished. The epub file is in $(BUILDDIR)/epub." latex: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo @echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex." @echo "Run \`make' in that directory to run these through (pdf)latex" \ "(use \`make latexpdf' here to do that automatically)." latexpdf: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo "Running LaTeX files through pdflatex..." $(MAKE) -C $(BUILDDIR)/latex all-pdf @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." latexpdfja: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo "Running LaTeX files through platex and dvipdfmx..." $(MAKE) -C $(BUILDDIR)/latex all-pdf-ja @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." text: $(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text @echo @echo "Build finished. The text files are in $(BUILDDIR)/text." man: $(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man @echo @echo "Build finished. The manual pages are in $(BUILDDIR)/man." texinfo: $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo @echo @echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo." @echo "Run \`make' in that directory to run these through makeinfo" \ "(use \`make info' here to do that automatically)." info: $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo @echo "Running Texinfo files through makeinfo..." make -C $(BUILDDIR)/texinfo info @echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo." gettext: $(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale @echo @echo "Build finished. The message catalogs are in $(BUILDDIR)/locale." changes: $(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes @echo @echo "The overview file is in $(BUILDDIR)/changes." linkcheck: $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck @echo @echo "Link check complete; look for any errors in the above output " \ "or in $(BUILDDIR)/linkcheck/output.txt." doctest: $(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest @echo "Testing of doctests in the sources finished, look at the " \ "results in $(BUILDDIR)/doctest/output.txt." xml: $(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml @echo @echo "Build finished. The XML files are in $(BUILDDIR)/xml." pseudoxml: $(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml @echo @echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml." ================================================ FILE: docs/conf.py ================================================ # -*- coding: utf-8 -*- # import inspect import os import re def get_version(): """Obtain the packge version from a python file e.g. pkg/__init__.py See . """ file_dir = os.path.realpath(os.path.dirname(__file__)) with open( os.path.join(file_dir, '..', 'consecution', '__init__.py')) as f: txt = f.read() version_match = re.search( r"""^__version__ = ['"]([^'"]*)['"]""", txt, re.M) if version_match: return version_match.group(1) raise RuntimeError("Unable to find version string.") # If extensions (or modules to document with autodoc) are in another directory, # add these directories to sys.path here. If the directory is relative to the # documentation root, use os.path.abspath to make it absolute, like shown here. #sys.path.insert(0, os.path.abspath('.')) # -- General configuration ------------------------------------------------ extensions = [ 'sphinx.ext.autodoc', 'sphinx.ext.intersphinx', #'sphinx.ext.viewcode', ] # Add any paths that contain templates here, relative to this directory. templates_path = ['_templates'] # The suffix of source filenames. source_suffix = '.rst' # The master toctree document. master_doc = 'toc' # General information about the project. project = 'consecution' copyright = '2017, Rob deCarvalho' # The short X.Y version. version = get_version() # The full version, including alpha/beta/rc tags. release = version exclude_patterns = ['_build'] # The name of the Pygments (syntax highlighting) style to use. pygments_style = 'sphinx' intersphinx_mapping = { 'python': ('http://docs.python.org/3.4', None), 'django': ('http://django.readthedocs.org/en/latest/', None), #'celery': ('http://celery.readthedocs.org/en/latest/', None), } # -- Options for HTML output ---------------------------------------------- html_theme = 'default' #html_theme_path = [] on_rtd = os.environ.get('READTHEDOCS', None) == 'True' if not on_rtd: # only import and set the theme if we're building docs locally import sphinx_rtd_theme html_theme = 'sphinx_rtd_theme' html_theme_path = [sphinx_rtd_theme.get_html_theme_path()] # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". # html_static_path = ['_static'] # Custom sidebar templates, maps document names to template names. #html_sidebars = {} # Additional templates that should be rendered to pages, maps page names to # template names. #html_additional_pages = {} # If true, "Created using Sphinx" is shown in the HTML footer. Default is True. html_show_sphinx = False # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True. html_show_copyright = True # Output file base name for HTML help builder. htmlhelp_basename = 'consecutiondoc' # -- Options for LaTeX output --------------------------------------------- latex_elements = { # The paper size ('letterpaper' or 'a4paper'). #'papersize': 'letterpaper', # The font size ('10pt', '11pt' or '12pt'). #'pointsize': '10pt', # Additional stuff for the LaTeX preamble. #'preamble': '', } # Grouping the document tree into LaTeX files. List of tuples # (source start file, target name, title, # author, documentclass [howto, manual, or own class]). latex_documents = [ ('index', 'consecution.tex', 'consecution Documentation', 'Rob deCarvalho', 'manual'), ] # -- Options for manual page output --------------------------------------- # One entry per manual page. List of tuples # (source start file, name, description, authors, manual section). man_pages = [ ('index', 'consecution', 'consecution Documentation', ['Rob deCarvalho'], 1) ] # -- Options for Texinfo output ------------------------------------------- # Grouping the document tree into Texinfo files. List of tuples # (source start file, target name, title, author, # dir menu entry, description, category) texinfo_documents = [ ('index', 'consecution', 'consecution Documentation', 'Rob deCarvalho', 'consecution', 'A short description', 'Miscellaneous'), ] def process_django_model_docstring(app, what, name, obj, options, lines): """ Does special processing for django model docstrings, making docs for fields in the model. """ # This causes import errors if left outside the function from django.db import models # Only look at objects that inherit from Django's base model class if inspect.isclass(obj) and issubclass(obj, models.Model): # Grab the field list from the meta class fields = obj._meta.fields for field in fields: # Decode and strip any html out of the field's help text help_text = strip_tags(force_unicode(field.help_text)) # Decode and capitalize the verbose name, for use if there isn't # any help text verbose_name = force_unicode(field.verbose_name).capitalize() if help_text: # Add the model field to the end of the docstring as a param # using the help text as the description lines.append(':param %s: %s' % (field.attname, help_text)) else: # Add the model field to the end of the docstring as a param # using the verbose name as the description lines.append(':param %s: %s' % (field.attname, verbose_name)) # Add the field's type to the docstring lines.append(':type %s: %s' % (field.attname, type(field).__name__)) # Return the extended docstring return lines def setup(app): # Register the docstring processor with sphinx app.connect('autodoc-process-docstring', process_django_model_docstring) ================================================ FILE: docs/index.rst ================================================ Overview ============================= Consecution is: * An easy-to-use pipeline abstraction inspired by `Apache Storm Topologies `_. * Designed to simplify building ETL pipelines that are robust and easy to test * A system for wiring together simple processing nodes to form a DAG, which is fed with a python iterable * Built using synchronous, single-threaded execution strategies designed to run efficiently on a single core * Implemented in pure-python with optional requirements that are needed only for graph visualization * Written with 100% test coverage See the `Github project page `_. for examples of how to use `consecution`. ================================================ FILE: docs/ref/consecution.rst ================================================ .. _ref-consecution: API documentation ================== Node ---- Nodes are the fundamental processing unit in consecution. A node is created by inheriting from the `consecution.Node` class. You are free to declare as many attributes and methods on a node class as you wish. You should not override the constructor unless you really know what you're doing. Instead, any initialization you wish to perform can be carried out in the `.begin()` method. In the descriptions below, it is assumed that the nodes being discussed have been wired together into a pipeline and are ready to consume items. See the `Github README `_ for examples of how to wire nodes into pipelines. Reserved Method Names ~~~~~~~~~~~~~~~~~~~~~ The following Node methods are not intended to be overridden, so you should not define methods with these names in your node implementations unless you really know what you are doing. * `top_node` * `initial_node_set` * `terminal_node_set` * `root_nodes` * `all_nodes` * `log` * `top_down_make_repr` * `top_down_call` * `depth_first_search` * `breadth_first_search` * `search` * `add_downstream` * `remove_downstream` * `plot` There are also a number of private method names you should avoid. These can be identified by looking at the `source code `_ Examples ~~~~~~~~ Here is the simplest possible node you could construct: .. code-block:: python from consecution import Node class MyNode(Node): def process(self, item): self.push(item) All nodes acquire a `.push()` method when they are wired into a pipeline. You can call this method anywhere in your class except in the `.begin()` method. The `.push(item)` method will take its argument and send it to the `.process()` methods of the nodes that are immediately downstream in your pipeline graph. Here is an example node defining all methods you can override. The functionality of each method is explained in the code comments. .. code-block:: python from consecution import Node class MyNode(Node): def begin(self): # This sets up whatever state you want to exist before the # node begins processing any data. You can think of it as an # init method that runs just before the node starts processing. # In this example, we initialize a simple counter self.counter = 0 def process(self, item): # This is the method that defines the processing you want to perform # on every item the node processes. You can place whatever logic # you want here, including calls the the .push() method. # In this example, we update the counter and push the item # downstream. self.counter += 1 self.push(item) def end(self): # This method is called right after all items are processed. # This happens when the iterator being consumed by the pipeline # is exhausted. At that point the .end() methods of all nodes # in the pipeline are called. This is a good place for you to # push any summary information downstream. # In this example we push the results of our counter self.push(self.counter) def reset(self): # A pipeline can be reused and reset back to its initial condition. # It does this by calling the .reset() method of all its member # nodes. You can place whatever code you want here to reset your # node to its initial state. # In this example, we simply reset the counter. self.counter = 0 Node API Documentation ~~~~~~~~~~~~~~~~~~~~~~ .. autoclass:: consecution.nodes.Node :members: GroupBy Node ~~~~~~~~~~~~~~~~~~~~~~ Consecution provides a special Node class specifically designed to do grouping. It works in much the sameway as Python's built in ``itertools.groupby`` function. It expects to process nodes in key-sorted order. In addition to the ``.process()`` method required of all nodes, you must also define a ``.key()`` method that will extract a key from each item being processed. See the Github project page for an example of using Groupby. .. autoclass:: consecution.nodes.GroupByNode :members: Manually Connecting Nodes ------------------------- The Node base class is equipped with an ``.add_downstream(other_node)`` method. This method provides detailed control over how nodes are wired together. It simply adds ``other_node`` as a downstream relation. Here is an example of creating a pipeline with one top node that broadcasts items to two downstream nodes, and then collects their results into a single output node. .. code-block:: python from consecution import Pipeline, Node from __future__ import print_function class SimpleNode(Node): def process(self, item): print('{} processing {}'.format(self.name, item)) self.push(item) top = SimpleNode('top') left = SimpleNode('left') right = SimpleNode('right') output = SimpleNode('output') top.add_downstream(left) top.add_downstream(right) left.add_downstream(output) right.add_downstream(output) pipe = Pipeline(top) pipe.consume(range(2)) Node Connection Mini-language ----------------------------- Consecution provides a concise domain-specific-language (DSL) for creating directed acyclic graphs. This is the preferred method for connecting nodes into a pipeline. However, you may occasionally find that your desired topology is not easy to express in the DSL. For these situations, consecution provides a lower-level escape hatch that allowes you to manually connect two nodes together. These two levels of abstraction provide a very powerful interface for constructing complex pipelines. The DSL is inspired by the unix syntax for chaining together the inputs and outputs of different programs at the bash prompt. You use the pipe symbol ``|`` to connect nodes together. These pipe operators will always return an object of one of the nodes in your connected topology. Below is an example of creating a simple linear pipeline. .. code-block:: python from consecution import Pipeline, Node from __future__ import print_function class SimpleNode(Node): def process(self, item): print('{} processing {}'.format(self.name, item)) self.push(item) left = SimpleNode('left') middle = SimpleNode('middle') right = SimpleNode('right') # wire nodes together with bash-like pipe operator node_object = left | middle | right # You can now pass the node object into a pipeline constructor pipe = Pipeline(node_object) pipe.consume(range(2)) In order to create a directed acyclic graph (DAG) you need four basic constructs: * Send data from one node to a single other node * Broadcast data from one node to a set of other nodes * Route data from one node to one of a set of other nodes * Gather output from several nodes into one node. The DSL provides mechanisms for each of these constructs, and we will look at each in turn. Send data from single node to single node ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Use simple bash-like pipe syntax to send data from a single node to another node. .. code-block:: python # Send data from one to to a single other node using bash-like piping. node1 | node2 Broadcast data from single node to multiple node ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Broadcasting is accomplished by piping to a list of nodes. In the following example, ``node1`` will send each item it pushes to ``node2``, ``node3``, and ``node4``. .. code-block:: python # Broadcast to a set of nodes by piping to a list node1 | [node2, node3, node4] Routing from one node to one of multiple nodes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Routing is accomplished by piping to a list that contains a single callable and any number of nodes. The following example will send even numbers to ``even_node`` and odd numbers to ``odd_node``. .. code-block:: python # Define a node class class N(Node): def process(self, item): self.push(item) # Define a routing function. It takes a single argument being the item # you pushed. It should return a string with the name of the node # to which that item should be routed. def route_func(item): if item % 2 == 0: return 'even_node' else: return 'odd_node' # Pipe to a list of nodes and a callable to achieve routing N('top_node') | [N('even_node'), N('odd_node'), route_func] Gather output from multiple nodes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Gathering output from a set of nodes is as simple as piping a list of nodes (and possibly a route function) to a single node. In this example, the outputs of ``node2``, ``node3``, and ``node4`` will all be sent to ``node5``. .. code-block:: python # Broadcast to a set of nodes by piping to a list node1 | [node2, node3, node4] | node5 Pipeline ----------------- Once nodes are wired together, they need to be encapsulated into a pipeline before they can operate on data. This is done by passing any node in the network as the argument to the ``Pipeline`` constructor. On construction, the pipeline will ensure you have a valid processing graph and will execute initialization code to ensure that the nodes are efficiently connected. Immediately after construction, the pipeline is ready to consume data. Consuming Iterables ~~~~~~~~~~~~~~~~~~~ When the ``.consume(iterable)`` method is called a sequence of events occur in exactly this order. #. The ``.begin()`` method on the pipeline object is called. You can override this method to perform any task you'd like. #. The ``.begin()`` methods of all nodes in the network are called. They are called in top-down order. What this means is that the ``.begin()`` method of a node is guaranteed to not be called until the ``.begin()`` methods of all its ancestors have been called. #. Items are read from the iterable argument supplied to the ``.consume()`` method. These are fed through the topology of the processing graph one by one. Each item is completely processed by the graph before the next one is lifted off the iterable. #. The ``.end()`` methods of all nodes are called in top-down order. #. The ``.end()`` method of the pipeline is called. Manually feeding Pipeline ~~~~~~~~~~~~~~~~~~~~~~~~~~ In addition to consuming iterables, you can manually feed pipelines using the ``.push()`` method on the pipeline itself. When you are finished pushing items, you can manually call the ``.end()`` method. Here is an example. .. code-block:: python from consecution import Node, Pipeline from __future__ import print_function class N(Node): def process(self, item): print(item) self.push(item) pipe = Pipeline(N('first') | N('second')) for nn in range(2): pipe.push(nn) pipe.end() Pipeline API Documentation ~~~~~~~~~~~~~~~~~~~~~~~~~~ Pipelines support dictionary-like access to their nodes. Here are examples. .. code-block:: python from consecution import Node, Pipeline # Define a node class N(Node): def process(self, item): self.push(item) # Create a pipeline with two nodes pipe = Pipeline(N('first') | N('second')) # Get reference to a node with dictionary syntax first = pipe['first'] # Replace a node with dictionary-like syntax pipe['first'] = N('first') .. autoclass:: consecution.pipeline.Pipeline :members: GlobalState ----------------- The ``GlobalState`` class is a simple python class that supports both dictionary-like and object-like attribute access. An object of this class will be used as the default ``global_state`` attribute of a pipeline if you don't explicitly provide one in the constructor. .. autoclass:: consecution.pipeline.GlobalState :members: ================================================ FILE: docs/toc.rst ================================================ Table of Contents ================= .. toctree:: :maxdepth: 2 index ref/consecution ================================================ FILE: pandashells.md ================================================ Pandashells One-liner Example === Pandashells lets you use Pandas from the bash command line. It allows you to combine unix command-line tools (awk, grep, sed, etc.) with the power of Pandas Dataframes and Matplotlib visualization. Here is a one-liner that performs the exact same aggregation demonstrated by the example consecution pipeline. ```bash cat sample_data.csv | \ p.df 'df["group"] = ["adult" if a>=18 else "child" for a in df.age]' | \ p.df 'df.pivot_table(index="group", columns="gender", values="spent", margins=True, aggfunc=sum).fillna(0)' \ -o table index ``` ================================================ FILE: publish.py ================================================ import subprocess subprocess.call('pip install wheel'.split()) subprocess.call('python setup.py clean --all'.split()) subprocess.call('python setup.py sdist'.split()) # subprocess.call('pip wheel --no-index --no-deps --wheel-dir dist dist/*.tar.gz'.split()) subprocess.call('python setup.py register sdist bdist_wheel upload'.split()) ================================================ FILE: sample_data.csv ================================================ gender,age,spent male,11,39.39 female,10,34.72 female,15,40.02 male,19,26.27 male,13,21.22 female,40,23.17 female,52,33.42 male,33,39.52 female,16,28.65 male,60,26.74 ================================================ FILE: setup.cfg ================================================ [nosetests] nocapture=1 verbosity=1 with-coverage=1 cover-branches=1 #cover-min-percentage=100 cover-package=consecution [coverage:report] show_missing=True fail_under=100 exclude_lines = # Have to re-enable the standard pragma pragma: no cover # Don't complain if tests don't hit defensive assertion code: raise NotImplementedError [coverage:run] omit = consecution/version.py consecution/__init__.py [flake8] max-line-length = 120 exclude = docs,env,*.egg max-complexity = 10 ignore = E402 [build_sphinx] source-dir = docs/ build-dir = docs/_build all_files = 1 [upload_sphinx] upload-dir = docs/_build/html [bdist_wheel] universal = 1 ================================================ FILE: setup.py ================================================ #!/usr/bin/env python import io import os import re from setuptools import setup, find_packages file_dir = os.path.dirname(__file__) def read(path, encoding='utf-8'): path = os.path.join(os.path.dirname(__file__), path) with io.open(path, encoding=encoding) as fp: return fp.read() def version(path): """Obtain the packge version from a python file e.g. pkg/__init__.py See . """ version_file = read(path) version_match = re.search(r"""^__version__ = ['"]([^'"]*)['"]""", version_file, re.M) if version_match: return version_match.group(1) raise RuntimeError("Unable to find version string.") LONG_DESCRIPTION = """ Consecution is an easy-to-use pipeline abstraction inspired by Apache Storm topologies. """ setup( name='consecution', version=version(os.path.join(file_dir, 'consecution', '__init__.py')), author='Rob deCarvalho', author_email='unlisted', description=('Pipeline Abstraction Library'), license='BSD', keywords=('pipeline apache storm DAG graph topology ETL'), url='https://github.com/robdmc/consecution', packages=find_packages(), long_description=LONG_DESCRIPTION, classifiers=[ 'Environment :: Console', 'Intended Audience :: Developers', 'Programming Language :: Python', 'Programming Language :: Python :: 2', 'Programming Language :: Python :: 3', 'Programming Language :: Python :: 2.7', 'Programming Language :: Python :: 3.5', 'Topic :: Scientific/Engineering', ], extras_require={'dev': ['nose', 'coverage', 'mock', 'flake8', 'coveralls']}, install_requires=['graphviz'] )