Repository: robdmc/consecution
Branch: develop
Commit: c23b4ea20fb7
Files: 29
Total size: 119.8 KB

Directory structure:
gitextract_eotr679u/

├── .coveragerc
├── .gitignore
├── .travis.yml
├── LICENSE
├── README.md
├── consecution/
│   ├── .coverage
│   ├── __init__.py
│   ├── nodes.py
│   ├── pipeline.py
│   ├── tests/
│   │   ├── __init__.py
│   │   ├── nodes_tests.py
│   │   ├── pipeline_tests.py
│   │   ├── testing_helpers.py
│   │   └── utils_tests.py
│   └── utils.py
├── docker/
│   ├── Dockerfile
│   ├── docker_build.sh
│   ├── docker_run.sh
│   └── simple_example.py
├── docs/
│   ├── Makefile
│   ├── conf.py
│   ├── index.rst
│   ├── ref/
│   │   └── consecution.rst
│   └── toc.rst
├── pandashells.md
├── publish.py
├── sample_data.csv
├── setup.cfg
└── setup.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .coveragerc
================================================
[report]
show_missing = True


================================================
FILE: .gitignore
================================================
.DS_Store
*.pyc


================================================
FILE: .travis.yml
================================================
sudo: false
language: python
python:
  - '2.7'
  - '3.4'
  - '3.5'
  - '3.6'
  - '3.7'
install:
  - pip install -e .[dev]
before_script:
  - flake8 .
script:
  - nosetests
  - coverage report --fail-under=100
after_success:
    - coveralls
notifications: 
    email: false

addons:
  apt:
    packages:
      - graphviz


================================================
FILE: LICENSE
================================================
Copyright (c) 2015, Robert deCarvalho
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this
   list of conditions and the following disclaimer. 
2. Redistributions in binary form must reproduce the above copyright notice,
   this list of conditions and the following disclaimer in the documentation
   and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

The views and conclusions contained in the software and documentation are those
of the authors and should not be interpreted as representing official policies, 
either expressed or implied, of the FreeBSD Project.


================================================
FILE: README.md
================================================
Update (2/23/2021)
===
It looks like this README is slowly turning into a reference of all the projects in this space that I think are better than consecution.
Here is [metaflow](https://github.com/Netflix/metaflow), an offering from Netflix.


Update (9/21/2020)
===
Another library that I believe to be better than consecution is the [pypeln](https://cgarciae.github.io/pypeln/) project.  The way it allows for a different number of workers on each node of a pipeline is quite nice.  Additionally the ability to control whether each node is run using threads, processes, async, or sync is really useful.


Update (5/1/2020)
===
Since writing this, the excellent [streamz](https://streamz.readthedocs.io/en/latest/) package has been created.  Streamz
is the project I wish had existed back when I wrote this.  It is a much more capable implementation of the of the core 
ideas of consecution, and plays nicely with [dask](https://dask.org/) to achieve scale.  I have started using streamz in my work in place of consecution.

Consecution
===
[![Build Status](https://travis-ci.org/robdmc/consecution.svg?branch=develop)](https://travis-ci.org/robdmc/consecution)
[![Coverage Status](https://coveralls.io/repos/github/robdmc/consecution/badge.svg?branch=develop)](https://coveralls.io/github/robdmc/consecution?branch=add_docs)

Introduction
---
Consecution is:
  * An easy-to-use pipeline abstraction inspired by <a href="http://storm.apache.org/releases/current/Tutorial.html"> Apache Storm Topologies</a>
  * Designed to simplify building ETL pipelines that are robust and easy to test
  * A system for wiring together simple processing nodes to form a DAG, which is fed with a python iterable
  * Built using synchronous, single-threaded execution strategies designed to run efficiently on a single core
  * Implemented in pure-python with optional requirements that are needed only for graph visualization
  * Written with 100% test coverage

Consecution makes it easy to build systems like this.

![Output Image](/images/etl_example.png?raw=true "ETL Example")


Installation
---
Consecution is a pure-python package that is simply installed with pip.  The only non-essential
requirement is the 
<a href="http://www.graphviz.org/">Graphviz</a> system package, which is only needed if you want to create
graphical representations of your pipeline.

<pre><code><strong>[~]$ pip install consecution</strong></code></pre>

Docker
---
If you would like to try out consecution on docker, check out consecution from github and navigate to the
`docker/` subdirectory.  From there, run the following.

* Build the consecution image: `docker_build.sh`
* Start a container: `docker_run.sh`
* Once in the container, run the example: `python simple_example.py`


Quick Start
---
What follows is a quick tour of consecution.  See the <a
href="http://consecution.readthedocs.io/en/latest/">API documentation</a> for
more detailed information.

### Nodes
Consecution works by wiring together nodes.  You create nodes by inheriting from the
`consecution.Node` class.  Every node must define a `.process()` method.  This method
contains whatever logic you want for processing single items as they pass through your
pipeline.  Here is an example of a node that simply logs items passing through it.
```python
from consecution import Node

class LogNode(Node):
    def process(self, item):
        # any logic you want for processing single item 
        print('{: >15} processing {}'.format(self.name, item))

        # send item downstream
        self.push(item)
```
### Pipelines
Now let's create a pipeline that wires together a series of these logging nodes.
We do this by employing the pipe symbol in  much the same way that you pipe data
between programs in unix.  Note that you must name nodes when you instantiate
them.
```python
from consecution import Node, Pipeline

# This is the same node class we defined above
class LogNode(Node):
    def process(self, item):
        print('{} processing {}'.format(self.name, item))
        self.push(item)

# Connect nodes with pipe symbols to create pipeline for consuming any iterable.
pipe = Pipeline(
    LogNode('extract') | LogNode('transform') | LogNode('load')
)
```
At this point, we can visualize the pipeline to verify that the topology is
what we expect it to be.  If you Graphviz installed, you can now simply type
one of the following to see the pipeline visualized.
```python
# Create a pipeline.png file in your working directory
pipe.plot()  

# Interactively display the pipeline visualization in an IPython notebook
# by simply making the final expression in a cell evaluate to a pipeline.
pipe
```
The plot command should produce the following visualization.

![Output Image](/images/etl1.png?raw=true "Three Node ETL Example")

If you don't have Graphviz installed, you can print the pipeline
object to get a text-based visualization.
```python
print(pipe)
```
This represents your pipeline as a series of pipe statements showing
how data is piped between nodes.
```
Pipeline
--------------------------------------------------------------------
  extract | transform
transform | load
--------------------------------------------------------------------
```


We can now process an iterable with our pipeline by running
```python
pipe.consume(range(5))
```
which will print the following to the console.
```
   extract processing 0
 transform processing 0
      load processing 0
   extract processing 1
 transform processing 1
      load processing 1
   extract processing 2
 transform processing 2
      load processing 2
```

### Broadcasting
Piping the output of a single node into a list of nodes will cause the single
node to broadcast its pushed items to every item in the list.  So, again, using
our logging node, we could construct a pipeline like this:
```python
from consecution import Node, Pipeline

class LogNode(Node):
    def process(self, item):
        print('{} processing {}'.format(self.name, item))
        self.push(item)

# pipe to a list of nodes to broadcast items
pipe = Pipeline(
    LogNode('extract') 
    | LogNode('transform') 
    | [LogNode('load_redis'), LogNode('load_postgres'), LogNode('load_mongo')]
)
pipe.plot()
pipe.consume(range(2))
```
The plot command produces this visualization

![Output Image](/images/broadcast.png?raw=true "Broadcast Example")

and consuming `range(2)` produces this output
```
      extract processing 0
    transform processing 0
   load_redis processing 0
load_postgres processing 0
   load_mongo processing 0
      extract processing 1
    transform processing 1
   load_redis processing 1
load_postgres processing 1
   load_mongo processing 1
```

### Routing
If you pipe to a list that contains multiple nodes and a single callable, then
consecution will interpret the callable as a routing function that accepts a
single item as its only argument and returns the name of one of the nodes in the
list.  The routing function will direct the flow of items as illustrated below.
```python
from consecution import Node, Pipeline

class LogNode(Node):
    def process(self, item):
        print('{: >15} processing {}'.format(self.name, item))
        self.push(item)
        
def parity(item):
    if item % 2 == 0:
        return 'transform_even'
    else:
        return 'transform_odd'

# pipe to a list containing a callable to achieve routing behaviour
pipe = Pipeline(
    LogNode('extract') 
    | [LogNode('transform_even'), LogNode('transform_odd'), parity] 
)
pipe.plot()
pipe.consume(range(4))
```
The plot command produces the following pipeline

![Output Image](/images/routing.png?raw=true "Routing Example")

and consuming `range(4)` produces this output
```
        extract processing 0
 transform_even processing 0
        extract processing 1
  transform_odd processing 1
        extract processing 2
 transform_even processing 2
        extract processing 3
  transform_odd processing 3
```


### Merging
Up to this point, we have the ability to create processing trees where nodes
can either broadcast to or route between their downstream nodes.  We can,
however, do more then this and create DAGs (Directed-Acyclic-Graphs).  Piping
from a list back to a single node will merge the output of all nodes in the
list together into the single downstream node like this.
```python
from consecution import Node, Pipeline

class LogNode(Node):
    def process(self, item):
        print('{: >15} processing {}'.format(self.name, item))
        self.push(item)
        
def parity(item):
    if item % 2 == 0:
        return 'transform_even'
    else:
        return 'transform_odd'

# piping from a list back to a single node merges items into downstream node
pipe = Pipeline(
    LogNode('extract') 
    | [LogNode('transform_even'), LogNode('transform_odd'), parity] 
    | LogNode('load')
)
pipe.plot()
pipe.consume(range(4))
```
The plot command produces the following pipeline

![Output Image](/images/dag.png?raw=true "DAG Example")

and consuming `range(4)` produces this output
```
        extract processing 0
 transform_even processing 0
           load processing 0
        extract processing 1
  transform_odd processing 1
           load processing 1
        extract processing 2
 transform_even processing 2
           load processing 2
        extract processing 3
  transform_odd processing 3
           load processing 3
```
### Managing Local State
Nodes are classes, and as such, you have the freedom to create any attribute you
want on a node.  You can actually define two additional methods on your nodes to
set up and tear down node-local state.  It is important to note the order of
execution here.  All nodes in a pipeline will execute their `.begin()` methods
in pipeline-order before any items are processed.  Each node will enter its
`.end()` method only after it has processed all items, and after all parent
nodes have finished their respective `.end()` methods.  Below, we've modified
our LogNode to keep a running sum of all items that pass through it and end by
printing their sum.
```python
from consecution import Node, Pipeline

class LogNode(Node):
    def begin(self):
        self.sum = 0
        print('{}.begin()'.format(self.name))

    def process(self, item):
        print('{: >15} processing {}'.format(self.name, item))
        self.sum += item
        self.push(item)

    def end(self):
        print('sum = {:d} in {}.end()'.format(self.sum, self.name))

# Identical pipeline to merge example above, but with modified LogNode
pipe = Pipeline(
    LogNode('extract') 
    | [LogNode('transform_even'), LogNode('transform_odd'), parity] 
    | LogNode('load')
)
pipe.consume(range(4))
```

Consuming `range(4)` produces the following output
```
extract.begin()
transform_even.begin()
transform_odd.begin()
load.begin()
        extract processing 0
 transform_even processing 0
           load processing 0
        extract processing 1
  transform_odd processing 1
           load processing 1
        extract processing 2
 transform_even processing 2
           load processing 2
        extract processing 3
  transform_odd processing 3
           load processing 3
sum = 6 in extract.end()
sum = 2 in transform_even.end()
sum = 4 in transform_odd.end()
sum = 6 in load.end()
```


### Managing Global State 
Every node object has a `.global_state` attribute that is shared globally across
all nodes in the pipeline.  The attribute is also available on the Pipeline
object itself.  The GlobalState object is a simple mutable python object whose
attributes can be mutated by any node.  It also remains accesible on the
Pipeline object after all nodes have completed.  Below is a simple example of
mutating and accessing global state.

```python
from consecution import Node, Pipeline, GlobalState

class LogNode(Node):
    def process(self, item):
        self.global_state.messages.append(
            '{: >15} processing {}'.format(self.name, item)
        )
        self.push(item)
        
# create a global state object with a messages attribute
global_state = GlobalState(messages=[])

# Assign the predefined global_state to the pipeline
pipe = Pipeline(
    LogNode('extract') | LogNode('transform') | LogNode('load'),
    global_state=global_state)
)
pipe.consume(range(3))

# print the content of the global state message list
for msg in pipe.global_state.messages:
    print msg
```

Printing the contents of the messages list produces
```
  extract processing 0
transform processing 0
     load processing 0
  extract processing 1
transform processing 1
     load processing 1
  extract processing 2
transform processing 2
     load processing 2
```

## Common Patterns
This section shows examples of how to implement some common patterns in
consecution.

### Map
Mapping with nodes is very simple. Just push an altered item downstream.
```python
from consecution import Node, Pipeline
class Mapper(Node):
    def process(self, item):
        self.push(2 * item)

class LogNode(Node):
    def process(self, item):
        print('{: >15} processing {}'.format(self.name, item))
        self.push(item)

pipe = Pipeline(
    LogNode('extractor') | Mapper('mapper') | LogNode('loader')
)

pipe.consume(range(3))
```
This will produce an output of
```
extractor processing 0
   loader processing 0
extractor processing 1
   loader processing 2
extractor processing 2
   loader processing 4
```

### Reduce
Reducing, or folding, is easily implemented by using the `.begin()`
and `.end()` methods to handle accumulated values.
```python
from consecution import Node, Pipeline
class Reducer(Node):
    def begin(self):
        self.result = 0
        
    def process(self, item):
        self.result += item
        
    def end(self):
        self.push(self.result)

class LogNode(Node):
    def process(self, item):
        print('{: >15} processing {}'.format(self.name, item))
        self.push(item)

pipe = Pipeline(
    LogNode('extractor') | Reducer('reducer') | LogNode('loader')
)

pipe.consume(range(3))
```
This will produce an output of
```
extractor processing 0
extractor processing 1
extractor processing 2
   loader processing 3
```

### Filter
Filtering is as simple as placing the push statement behind a conditional. All
items that don't pass the conditional will not be pushed downstream, and thus
silently dropped.
```python
from consecution import Node, Pipeline
class Filter(Node):
    def process(self, item):
        if item > 3:
            self.push(item)

class LogNode(Node):
    def process(self, item):
        print('{: >15} processing {}'.format(self.name, item))
        self.push(item)

pipe = Pipeline(
    LogNode('extractor') | Filter('filter') | LogNode('loader')
)

pipe.consume(range(6))
```
This produces an output of
```
extractor processing 0
extractor processing 1
extractor processing 2
extractor processing 3
extractor processing 4
   loader processing 4
extractor processing 5
   loader processing 5
```

### Group By
Consecution provides a specialized class you can inherit from to perform
grouping operations.  GroupBy nodes must define two methods: `.key(item)` and
`.process(batch)`.  The `.key` method should return a key from an item that is used
to identify groups.  Any time that key changes, a new group is initiated.  Like
Python's `itertools.groupby`, you will usually want the GroupByNode to process
sorted items.  The `.process` method functions exactly like the `.process`
method on regular nodes, except that instead of being called with items,
consecution will call it with a batch of items contained in a list.
```python
class LogNode(Node):
    def process(self, item):
        print('{: >15} processing {}'.format(self.name, item))
        self.push(item)

class Batcher(GroupByNode):
    def key(self, item):
        return item // 4
    
    def process(self, batch):
        sum_val = sum(batch)
        self.push(sum_val)
        
pipe = Pipeline(
    Batcher('batcher') | LogNode('logger') 
)

pipe.consume(range(16))
```
This produces an output of
```
logger processing 6
logger processing 22
logger processing 38
logger processing 54
```

### Plugin-Style Composition
Consecution forces you to think about problems in terms of how small processing
units are connected.  This separation between logic and connectivity can be
exploited to create flexible and reusable solutions.  Basically, you specify the
connectivity you want to use in solving your problem, and then plug in the
processing units later.  Breaking the problem up in this way allows you to swap
out processing units to acheive different objectives with the same pipeline.

```python
# This function defines a pipeline that can use swappable processing nodes.
# We don't worry about how we are going to do logging or aggregating.
# We just focus on how the nodes are connected.
def pipeline_factory(log_node, agg_node):
    pipe = Pipeline(
        log_node('extractor') | agg_node('aggregator') | log_node('result_logger')
    )
    return pipe


# Now we define a node for left-justified logging
class LeftLogNode(Node):
    def process(self, item):
        print('{: <15} processing {}'.format(self.name, item))
        self.push(item)

# And one for right-justified logging
class RightLogNode(Node):
    def process(self, item):
        print('{: >15} processing {}'.format(self.name, item))
        self.push(item)

# We can aggregate by summing
class SumNode(Node):
    def begin(self):
        self.result = 0
        
    def process(self, item):
        self.result += item
        
    def end(self):
        self.push(self.result)

# Or we can aggregate by multiplying
class ProdNode(Node):
    def begin(self):
        self.result = 1
        
    def process(self, item):
        self.result *= item
        
    def end(self):
        self.push(self.result)


# Now we plug in nodes to create a pipeline that left-prints sums
sum_pipeline = pipeline_factory(log_node=LeftLogNode, agg_node=SumNode)

# And a different pipeline that right prints products
prod_pipeline = pipeline_factory(log_node=RightLogNode, agg_node=ProdNode)

print 'aggregate with sum, left justified\n' + '-'*40
sum_pipeline.consume(range(1, 5))

print '\naggregate with product, right justified\n' + '-'*40
prod_pipeline.consume(range(1, 5))
```
This produces the following output
```
aggregate with sum, left justified
----------------------------------------
extractor       processing 1
extractor       processing 2
extractor       processing 3
extractor       processing 4
result_logger   processing 10

aggregate with product, right justified
----------------------------------------
      extractor processing 1
      extractor processing 2
      extractor processing 3
      extractor processing 4
  result_logger processing 24
```

# Aggregation Example
We end with a full-blown example of using a pipeline to aggregate data from a
csv file.  The data is contained in 
<a href="https://raw.githubusercontent.com/robdmc/consecution/master/sample_data.csv">
a csv file </a> that looks like this.

gender |age |spent
---    |--- |---
male   |11  |39.39
female |10  |34.72
female |15  |40.02
male   |19  |26.27
male   |13  |21.22
female |40  |23.17
female |52  |33.42
male   |33  |39.52
female |16  |28.65
male   |60  |26.74


Although there are much simpler ways of solving this problem, (e.g. with <a
href="https://github.com/robdmc/consecution/blob/master/pandashells.md">
Pandashells</a>)
we deliberately construct a complex topology just to illustrate how to achieve
complexity when it is actually needed.

The diagram below was produced from the code beneath it.  A quick glance at the
diagram makes it obvious how the data is being routed through the system.  The
code is heavily commented to explain features of the consecution toolkit.

![Output Image](/images/gender_age.png?raw=true "Gender Age Pipeline")

```python
from __future__ import print_function
from collections import namedtuple
from pprint import pprint
import csv
from consecution import Node, Pipeline, GlobalState

# Named tuples are nice immutable containers 
# for passing data between nodes
Person = namedtuple('Person', 'gender age spent')

# Create a pipeline that aggregates by gender and age
# In creating the pipeline we focus on connectivity and don't
# worry about defining node behavior.
def pipe_factory(Extractor, Agg, gender_router, age_router):
    # Consecution provides a generic GlobalState class.  Any object can be used
    # as the global_state in a pipeline, but the GlobalState object provides a
    # nice abstraction where attributes can be accessed either by dot notation
    # (e.g. global_state.my_attribute) or by dictionary notation (e.g.
    # global_state['my_attribute'].  Furthermore, GlobalState objects can be
    # instantiated with initialized attributes using key-word arguments as shown
    # here.
    global_state = GlobalState(segment_totals={})

    # Notice, we haven't even defined the behavior of these nodes yet.  They
    # will be defined later and are, for now, just passed into the factory
    # function as arguments while we focus on getting the topology right.
    pipe = Pipeline(
        Extractor('make_person') |
        [
            gender_router,
            (Agg('male') | [age_router, Agg('male_child'), Agg('male_adult')]),
            (Agg('female') | [age_router, Agg('female_child'), Agg('female_adult')]),
        ],
        global_state=global_state
    )

    # Nodes can be created outside of a pipeline definition
    adult = Agg('adult')
    child = Agg('child')
    total = Agg('total')

    # Sometimes the topology you want to create cannot easily be expressed
    # using the pipeline abstraction for wiring nodes together.  You can
    # drop down to a lower level of abstraction by explicitly wiring nodes 
    # together using the .add_downstream() method.
    adult.add_downstream(total)
    child.add_downstream(total)

    # Once a pipeline has been created, you can access individual nodes
    # with dictionary-like indexing on the pipeline.
    pipe['male_child'].add_downstream(child)
    pipe['female_child'].add_downstream(child)
    pipe['male_adult'].add_downstream(adult)
    pipe['female_adult'].add_downstream(adult)

    return pipe

# Now that we have the topology of our pipeline defined, we can think about the
# logic that needs to go into each node.  We start by defining a node that takes
# a row from a csv file and tranforms it into a namedtuple.
class MakePerson(Node):
    def process(self, item):
        item['age'] = int(item['age'])
        item['spent'] = float(item['spent'])
        self.push(Person(**item))

# We now define a node to perform our aggregations.  Mutable global state comes
# with a lot of baggage and should be used with care.  This node illustrates
# how to use global state to put all aggregations in a central location that
# remains accessible when the pipeline finishes processing.
class Sum(Node):
    def begin(self):
        # initialize the node-local sum to zero
        self.total = 0

    def process(self, item):
        # increment the node-local total and push the item down stream
        self.total += item.spent
        self.push(item)

    def end(self):
        # when pipeline is done, update global state with sum
        self.global_state.segment_totals[self.name] = round(self.total, 2)


# This function routes tuples based on their associated gender
def by_gender(item):
    return '{}'.format(item.gender)

# This function routes tuples based on whether the purchaser was an adult or
# child
def by_age(item):
    if item.age >= 18:
        return '{}_adult'.format(item.gender)
    else:
        return '{}_child'.format(item.gender)

# Here we plug our node definitions into our topology to create a fully-defined
# pipeline.
pipe = pipe_factory(MakePerson, Sum, by_gender, by_age)

# We can now visualize pipeline.
pipe.plot()

# Now we feed our pipeline with rows from the csv file
with open('sample_data.csv') as f:
    pipe.consume(csv.DictReader(f))

# The global_state is also available as an attribute on the pipeline allowing
# us to access it when the pipeline is finished.  This is a good way to "return"
# an object from a pipeline.  Here we simply print the result.
print()
pprint(pipe.global_state.segment_totals)
```

And this is the result of running the pipeline with the sample csv file.
```
{'adult': 149.12,
 'child': 164.0,
 'female': 159.98,
 'female_adult': 56.59,
 'female_child': 103.39,
 'male': 153.14,
 'male_adult': 92.53,
 'male_child': 60.61,
 'total': 313.12}
```

As illustrated in the <a
href="https://github.com/robdmc/consecution/blob/master/pandashells.md">
Pandashells</a> example, this aggregation is actually much more simple to
implement in Pandas.  However, there are a couple of important caveats.

The Pandas solution must load the entire csv file into memory at once.  If you
look at the pipeline solution, you will notice that each node simply increments
its local sum and passes the data downstream.  At no point is the data
completely loaded into memory.  Although the Pandas code runs much faster due to
the highly optimized vectorized math it employes, the pipeline solution can
process arbitrarily large csv files with a very small memory footprint.

Perhaps the most exciting aspect of consecution is its ability to create
repeatable and testable data analysis pipelines.  Passing Pandas Dataframes
through a consecution pipeline makes it very easy to encapsulate any analysis
into a well-defined, repeatable process where each node manipulates a dataframe
in its prescribed way. Adopting this structure in analysis projects will
undoubtedly ease the transition from analysis/research into production.

___
Projects by [robdmc](https://www.linkedin.com/in/robdecarvalho).
* [Pandashells](https://github.com/robdmc/pandashells) Pandas at the bash command line
* [Consecution](https://github.com/robdmc/consecution) Pipeline abstraction for Python
* [Behold](https://github.com/robdmc/behold) Helping debug large Python projects
* [Crontabs](https://github.com/robdmc/crontabs) Simple scheduling library for Python scripts
* [Switchenv](https://github.com/robdmc/switchenv) Manager for bash environments
* [Gistfinder](https://github.com/robdmc/gistfinder) Fuzzy-search your gists


================================================
FILE: consecution/__init__.py
================================================
# flake8: noqa
from consecution.nodes import Node, GroupByNode
from consecution.pipeline import Pipeline, GlobalState
from consecution.utils import Clock

__version__ = '0.2.0'


================================================
FILE: consecution/nodes.py
================================================
import sys
from collections import Counter, deque, OrderedDict
import traceback
from consecution.utils import Clock


class Node(object):
    """
    :type name: str
    :param str: The name of this node.  Must be unique within a pipeline.

    :type kwargs:  keyword args
    :param kwargs: Any additional keyword args are assigned as attributes
                   on the node.

    You create nodes by inheriting from this class.  You will be required to
    implement a `.process()` on your class.  You can call the `.push()` method
    from anywhere in your class implementation except from within the
    `.begin()` method.

    Note that although this documentation refers to "the `.push` method",
    `push` is actually  a callable attribute assigned when nodes are placed
    into pipelines.

    Its signature is `.push(item)`, where `item` can be anything you want pushed
    to nodes connected to the downstream side of the node.

    """
    def __init__(self, name, **kwargs):
        # assign any user-defined attributes
        for k, v in kwargs.items():
            setattr(self, k, v)
        self.name = name
        self._upstream_nodes = []
        self._downstream_nodes = []

        self._num_top_down_calls = 0

        # node network can be visualized with pydot.  These hold args and kwargs
        # that will be used to add and connect this node in the graph visualization
        self._pydot_node_kwargs = dict(name=self.name, shape='rectangle')
        self._pydot_edge_kwarg_list = []

        self._router = None

        # this will be one of three values: None, 'input', 'output'
        self._logging = None

        # add a clock to allow for timing
        self.clock = Clock()

    def __str__(self):
        return 'N({})'.format(self.name)

    def __repr__(self):
        return self.__str__()

    def __hash__(self):
        """
        define __hash__ method. dicts and sets will use this as key
        """
        return id(self)

    def __eq__(self, other):
        return self.__hash__() == other.__hash__()

    def __lt__(self, other):
        """
        I need this to be able to sort by name
        """
        return self.name < other.name

    def __getitem__(self, key):
        msg = (
            '\n\nYou cannot call __getitem__ on nodes.  You tried to call\n'
            '{self} [{key}]\n'
            'which doesn\'t make sense.  You probably meant\n'
            '{self} | [{key}]\n'
        ).format(self=self, key=key)
        raise ValueError(msg)

    def _get_flattened_list(self, obj):
        if isinstance(obj, Node):
            return [obj]

        elif hasattr(obj, '__iter__'):
            nodes = []
            for el in obj:
                if isinstance(el, Node):
                    nodes.append(el)
                elif hasattr(el, '__iter__'):
                    nodes.extend(self._get_flattened_list(el))
            return nodes
        else:
            msg = (
                'Don\'t know what to do with {}.  It\'s not a node, and it\'s '
                'not iterable.'
            ).format(repr(obj))
            raise ValueError(msg)

    def _get_exposed_slots(self, obj, pointing):
        nodes = set()
        for node in self._get_flattened_list(obj):
            if pointing == 'left':
                nodes = nodes.union(node.initial_node_set)
            elif pointing == 'right':
                nodes = nodes.union(node.terminal_node_set)
            else:
                raise ValueError('pointing must be "left" or "right"')
        return nodes

    def _connect_lefts_to_rights(self, lefts, rights, router=None):
        slots_from_left = self._get_exposed_slots(lefts, pointing='right')
        slots_from_right = self._get_exposed_slots(rights, pointing='left')
        for left in slots_from_left:
            router_node = None
            if router:
                router_name = '{}.{}'.format(
                    left.name, self._get_object_name(router))
                end_point_map = {n.name: n for n in slots_from_right}
                router_node = _RouterNode(
                    router_name, end_point_map, router)
                left.add_downstream(router_node)
            for right in slots_from_right:
                if router_node:
                    router_node.add_downstream(right)
                else:
                    left.add_downstream(right)

    def _get_object_name(self, obj):
        class_name = obj.__class__.__name__
        if class_name == 'function':
            return obj.__name__
        else:
            return class_name

    def _get_router(self, obj):
        router = None
        if hasattr(obj, '__iter__'):
            routers = [el for el in obj if hasattr(el, '__call__')]
            router = routers[0] if routers else None
        return router

    def __or__(self, other):
        router = self._get_router(other)
        self._connect_lefts_to_rights(self, other, router)
        return self

    def __ror__(self, other):
        self._connect_lefts_to_rights(other, self)
        return self

    @property
    def top_node(self):
        """
        This attribute always holds the top-most node in the node graph.
        Consecution only allows one top node.
        """
        root_nodes = self.root_nodes
        if len(root_nodes) > 1:
            msg = 'You must remove one of the following input nodes {}'.format(
                root_nodes)
            raise ValueError(msg)
        else:
            return root_nodes.pop()

    @property
    def terminal_node_set(self):
        """
        This attribute holds a set of all bottom nodes in the node graph.
        """
        return {
            node for node in self.depth_first_walk('down')
            if len(node._downstream_nodes) == 0
        }

    @property
    def initial_node_set(self):
        """
        When piecing together fragments of a graph, you can temporarily have
        connected nodes with multiple "top-nodes."  This method returns this
        set of nodes.  Node that consecution can only make pipelines from
        graphs having a single top node.
        """
        self.depth_first_walk('up')
        return {
            node for node in self.depth_first_walk('up')
            if len(node._upstream_nodes) == 0
        }

    @property
    def root_nodes(self):
        """
        This attribute holds a list of all nodes that do not have any upstream
        nodes attached.
        """
        return [
            node for node in self.all_nodes
            if len(node._upstream_nodes) == 0
        ]

    @property
    def all_nodes(self):
        """
        This attribute contains a set of all nodes in the graph.
        """
        return self.depth_first_walk('both')

    def log(self, what):
        """
        Calling this method on a node will turn on its logging feature.  This
        means that the node will print logged items to the console.  You can
        choose whether to log the inputs or outputs of a node.

        :type name: what
        :param what: One of 'input' or 'output' indicating whther you want to
                     log the input or output of this node.
        """
        allowed = ['input', 'output']
        if what not in allowed:
            raise ValueError(
                '\'what\' argument must be in {}'.format(allowed)
            )
        self._logging = what

    def _get_downstream_reps(self):
        if self._downstream_nodes:
            downstreams = sorted([n.name for n in self._downstream_nodes])

            if len(downstreams) == 1:
                downstreams = downstreams[0]

            template = '{{: >{}s}} | {{}}\n'.format(
                self.pipeline._longest_node_name_len_)

            self.pipeline._node_repr += template.format(
                self.name, downstreams).replace('\'', '')

    def top_down_make_repr(self):
        """
        You should never need to use this method.  It iterates through the node
        graph in top-down order making a repr string for each node.
        """
        if not hasattr(self, 'pipeline'):
            raise ValueError(
                'top_down_make_repr can only be called for nodes in a pipeline')

        self.pipeline._longest_node_name_len_ = max(
            len(n.name) for n in self.all_nodes)
        self.pipeline._node_repr = ''
        self.top_node.top_down_call('_get_downstream_reps')

    def top_down_call(self, method_name):
        """
        This utility method traverses the graph in top-down order and invokes
        the named method on every node it encounters. It is used internally
        to make sure the `.begin()` and `.end()` methods are not called before
        their upstream counterparts.

        :type method_name: str
        :param method_name: The name of the method you would like to call in
                            top-down order.
        """
        # record the number of upstreams this node has
        num_upstreams = len(self._upstream_nodes)

        # if this node isn't pulling from multiple upstreams, it's ready
        # to recurse to downstreams
        if num_upstreams <= 1:
            ready_for_downstreams = True
        # this node isn't ready to recurse to downstreams until the current
        # call would mean the last required call.
        elif self._num_top_down_calls == num_upstreams - 1:
            ready_for_downstreams = True
        else:
            ready_for_downstreams = False

        # if ready to recurse, then call the method on self and recurse
        # downwards.
        if ready_for_downstreams:
            getattr(self, method_name)()
            for downstream in self._downstream_nodes:
                downstream.top_down_call(method_name)
            self._num_top_down_calls = 0
        else:
            self._num_top_down_calls += 1

    def depth_first_walk(self, direction='both', as_ordered_list=False):
        """
        This method walks the graph of connected nodes in depth-first
        order.  It uses a stack to emulate recursion. See good explanation at
        https://jeremykun.com/2013/01/22/depth-and-breadth-first-search/

        :type direction: str
        :param direction: one of 'up', 'down' or 'both' specifying the direction
                          to walk.
        :type as_ordered_list: Bool
        :param as_ordered_list: If set to true, returns the walked nodes as
                                an ordered list instead of an unordered set.

        :rtype: list or set
        :return: An iterable of the discovered nodes.
        """
        return self.walk(
            direction=direction, how='depth_first',
            as_ordered_list=as_ordered_list)

    def breadth_first_walk(self, direction='both', as_ordered_list=False):
        """
        This method walks the graph of connected nodes in breadth-first
        order.  It uses a stack to emulate recursion. See good explanation at
        https://jeremykun.com/2013/01/22/depth-and-breadth-first-search/

        :type direction: str
        :param direction: one of 'up', 'down' or 'both' specifying the direction
                          to walk.
        :type as_ordered_list: Bool
        :param as_ordered_list: If set to true, returns the walked nodes as
                                an ordered list instead of an unordered set.

        :rtype: list or set
        :return: An iterable of the discovered nodes.
        """
        return self.walk(
            direction=direction, how='breadth_first',
            as_ordered_list=as_ordered_list)

    def walk(
            self, direction='both', how='breadth_first', as_ordered_list=False):

        """
        This is the core algorithm for walking a graph in specified order.  It
        is used by the `breadth_first_walk` and `depth_first_walk` methods.

        :type how: str
        :param how: one of 'breadth_first' or 'depth_first'

        :type direction: str
        :param direction: one of 'up', 'down' or 'both' specifying the direction
                          to walk.
        :type as_ordered_list: Bool
        :param as_ordered_list: If set to true, returns the walked nodes as
                                an ordered list instead of an unordered set.

        :rtype: list or set
        :return: An iterable of the discovered nodes.
        """
        if how not in {'depth_first', 'breadth_first'}:
            raise ValueError(
                '\'how\' argument must be one of '
                '[\'depth_first\', \'breadth_first\']'
            )
        # What I really want is an ordered set, which doesn't exist.  So I'm
        # using the keys of an ordered dict to get the functionality I want.
        # I have no need for the values in this dict, only the keys.
        visited_nodes = OrderedDict()

        # holds nodes that still need to be explored
        queue = deque([self])

        # while I still have nodes that need exploring
        while len(queue) > 0:
            # get the next node to explore
            node = queue.pop()

            # if I've already seen this node, nothing to do, so go to next
            if node in visited_nodes:
                continue

            # Make sure I don't visit this node again
            # again.  I'm using an ordered dict to mimic an ordered set.
            # I have no need for the value, so set it to None
            visited_nodes[node] = None

            neighbor_dict = {
                'up': node._upstream_nodes,
                'down': node._downstream_nodes,
                'both': node._upstream_nodes + node._downstream_nodes,
            }
            if direction not in neighbor_dict:
                raise ValueError(
                    'direction must be \'up\', \'dowwn\' or \'both\'')
            neighbors = neighbor_dict[direction]

            # search all neightbors to this node for unvisited nodes
            for node in neighbors:
                # if you find unvisited node, add it to nodes needing visit
                if node not in visited_nodes:
                    if how == 'breadth_first':
                        queue.appendleft(node)
                    else:
                        queue.append(node)

        # should have hit all nodes in the graph at this point
        if as_ordered_list:
            return list(visited_nodes.keys())
        else:
            return set(visited_nodes.keys())

    def _check_for_dups(self):
        counter = Counter()
        for node in self.all_nodes:
            counter.update({node.name: 1})
        dups = [name for (name, count) in counter.items() if count > 1]
        if dups:
            msg = (
                '\n\nNode names must be unique.  Dupicates {} found.'
            ).format(list(dups))
            raise ValueError(msg)
        return

    def _check_for_cycles(self):
        self_and_upstreams = self.depth_first_walk('up')
        downstreams = self.depth_first_walk('down') - {self}
        common_nodes = self_and_upstreams.intersection(downstreams)
        if common_nodes:
            raise ValueError('\n\nYour graph is not acyclic.  It has loops.')

    def _validate_node(self, other):
        # only nodes allowed to be connected
        if not isinstance(other, Node):
            raise ValueError('Trying to connect a non-node type')

    def add_downstream(self, other):
        """
        You will probably use this method quite a bit.  It is used to manually
        attach a downstream node.

        :type other: consecution.Node
        :param other: An instance of the node you want to attach
        """
        self._validate_node(other)
        self._downstream_nodes.append(other)
        other._upstream_nodes.append(self)

        self._check_for_dups()
        if self.name == other.name:
            raise ValueError('{} can\'t be downstream to itself'.format(self))
        self._check_for_cycles()

        self._pydot_edge_kwarg_list.append(
            dict(tail_name=self.name, head_name=other.name))

    def remove_downstream(self, other):
        """
        This method removes the given node from being attached as a downstream
        node.

        :type other: consecution.Node
        :param other: An instance of the node you want to remove
        """
        # remove self from the other's upstreams
        other._upstream_nodes = [
            n for n in other._upstream_nodes if n.name != self.name]

        # remove other from self's downstream nodes
        self._downstream_nodes = [
            n for n in self._downstream_nodes if n.name != other.name]

        # remove this connection from the pydot kwargs list
        new_kwargs_list = []
        for kwargs in self._pydot_edge_kwarg_list:
            if kwargs['head_name'] == other.name:
                continue
            new_kwargs_list.append(kwargs)
        self._pydot_edge_kwarg_list = new_kwargs_list

    def _build_pydot_graph(self):
        """
        This private method builds a pydot graph
        """
        # define kwargs lists for creating the visualization (these are closure vars for function below)
        node_kwargs_list, edge_kwargs_list = [], []

        # define a function to map over all nodes to aggreate viz kwargs
        def collect_kwargs(node):
            node_kwargs_list.append(node._pydot_node_kwargs)
            edge_kwargs_list.extend(node._pydot_edge_kwarg_list)

        for node in self.all_nodes:
            collect_kwargs(node)

        # doing import inside method so that pydot dependency is optional
        from graphviz import Digraph

        # create a pydot graph
        graph = Digraph(comment='pipeline')

        # create pydot nodes for every node connected to this one
        for node_kwargs in node_kwargs_list:
            graph.node(**node_kwargs)

        # creat pydot edges between all nodes connected to this one
        for edge_kwargs in edge_kwargs_list:
            graph.edge(**edge_kwargs)

        return graph

    def plot(
            self, file_name='pipeline', kind='png'):
        """
        This method draws a visualization of your processing graph.  You must
        have graphviz installed on your system for it to work properly.  (See
        install instructions.)

        If you are running consecution in an Jupyter notebook, you can display
        an inline visualization of a pipeline by simply making the pipeline be
        the final expression in a cell.

        :type file_name: str
        :param file_name: The name of the image file to generate

        :type kind: str
        :param kind: The kind of file to generate (png, pdf)
        """
        graph = self._build_pydot_graph()

        # define allowed formats for saving the graph visualization
        ALLOWED_KINDS = {'pdf', 'png'}
        if kind not in ALLOWED_KINDS:
            raise ValueError('Only the following kinds are supported: {}'.format(ALLOWED_KINDS))

        # set the output format
        graph.format = kind

        file_name = file_name.replace('.{}'.format(kind), '')

        # write the output file
        try:
            graph.render(file_name)
        except RuntimeError:
            sys.stderr.write(
                '\n\n'
                '=========================================================\n'
                'Problem executing GraphViz.  Make sure you have it\n'
                'properly installed.\n'
                'http://www.graphviz.org/\n'
                'If you are on a mac, you should be able to install it with\n'
                'brew install graphviz.\n\n'
                'If you are on ubuntu, you can install it with\n'
                'apt-get install graphviz\n'
                '=========================================================\n'
                '\n\n'
            )
            raise

    def process(self, item):
        """
        :type item: object
        :param item: The item this node should process

        You must override this method with your own logic.
        """
        raise NotImplementedError(
            (
                'Error in node named {}\n'
                'You must define a .process(self, item) method on all nodes'
            ).format(repr(self.name))
        )

    def reset(self):
        """
        User can override this to do whatever logic they want.
        """

    def _logged_process(self, item):
        if self._logging == 'input':
            self._write_log(item)
        self.process(item)

    def _begin(self):
        try:
            self.begin()
        except AttributeError:
            e = sys.exc_info()[1]
            tb = sys.exc_info()[2]
            (
                code_file, line_no, method_name, line_txt
            ) = traceback.extract_tb(tb)[-1]
            msg = str(e) + (
                '\n\nError in .begin() method of \'{}\' node.\n'
                'Are you trying to call .push() from inside the\n'
                '.begin() method?  That is not allowed.\n\n'
                'file: {}, line{}\n--> {}\n\n'
            ).format(self.name, code_file, line_no, line_txt)
            traceback.print_exc()
            raise AttributeError(msg)

    def begin(self):
        pass

    def end(self):
        pass

    def _write_log(self, item):
        sys.stdout.write('node_log,{},{},{}\n'.format(self._logging, self.name, item))

    def _push(self, item):
        """
        This is the default pusher.  It pushes to all downstreams.
        """
        if self._logging == 'output':
            self._write_log(item)

        # The _process attribute will be set to the appropriate callable
        # when initializing the pipeline.  I do this because I want the
        # chaining to be as efficient as possible.  If logging is not set,
        # I don't want to have to hit that logic every push, so I just
        # invoke a callable attribute at each process that has been set
        # to the appropriate callable.
        for downstream in self._downstream_nodes:
            downstream._process(item)


class _RouterNode(Node):
    """
    This node will route to downstreams.  The router function needs to
    return the name of the destination node.
    """
    def __init__(self, name, end_point_map, route_callable):
        super(_RouterNode, self).__init__(name)
        self._end_point_map = end_point_map
        self._pydot_node_kwargs = dict(name=self.name, shape='oval')
        self._route_callable = route_callable

    def process(self, item):
        """
        This is the default pusher.  It pushes to all downstreams.
        """
        node = self._end_point_map.get(self._route_callable(item), None)
        if node is None:
            raise ValueError(
                (
                    '\n\nRouter node {} encountered bad route path {}.  Valid '
                    'route paths are {}.'
                ).format(
                    self.name,
                    repr(self._route_callable(item)),
                    [n.name for n in self._downstream_nodes]
                )
            )

        node._process(item)


class GroupByNode(Node):
    def __init__(self, *args, **kwargs):
        super(GroupByNode, self).__init__(*args, **kwargs)
        self._batch_ = []
        self._previous_key = '__no_previous_key__'

    def key(self, item):
        """
        You must define this method.

        :type item: object
        :param item: The item you are processing

        :rtype: hashable object
        :return: a hashable object that serves as a key for the grouping process
        """
        raise NotImplementedError(
            'you must define a .key(self, item) method on all '
            'GroupBy nodes.'
        )

    def process(self, batch):
        """
        You must define this method.

        :type batch: iterable
        :param batch: A batch of items having the same key
        """
        raise NotImplementedError(
            'You must define a .process(self, batch) method on all GroupBy '
            'nodes.'
        )

    def _process_item(self, item):
        key = self.key(item)
        if key != self._previous_key:
            self._previous_key = key
            if len(self._batch_) > 0:
                self.process(self._batch_)
            self._batch_ = [item]
        else:
            self._batch_.append(item)

    def _end(self):
        self.process(self._batch_)
        self._batch_ = []

    def __getattribute__(self, name):
        """
        This should trap for the end() method calls and install
        pre hook.
        """
        if name == 'end':
            def wrapper():
                self._end()
                return super(GroupByNode, self).__getattribute__(name)()
            return wrapper
        else:
            return super(GroupByNode, self).__getattribute__(name)


================================================
FILE: consecution/pipeline.py
================================================
import sys
from consecution.nodes import GroupByNode


class GlobalState(object):
    """
    GlobalState is a simple container class that sets its attributes from
    constructor kwargs.  It supports both object and dictionary access to its
    attributes.  So, for example, all of the following statements are supported.

    .. code-block:: python

       from consecution import GlobalState

       global_state = GlobalState(a=1, b=2)
       global_state['c'] = 2
       a = global_state['a']

    An object of this class will be created as the default ``.global_state``
    attribute on a Pipeline if you do not explicitely provide a global_state
    argument to the constructor.
    """
    # I'm using unconventional "_item_self_" name here to avoid
    # conflicts when kwargs actually contain a "self" arg.

    def __init__(_item_self, **kwargs):
        for key, val in kwargs.items():
            _item_self[key] = val

    def __str__(_item_self):
        quoted_keys = [
            '\'{}\''.format(k) for k in sorted(vars(_item_self).keys())]
        att_string = ', '.join(quoted_keys)
        return 'GlobalState({})'.format(att_string)

    def __repr__(_item_self):
        return _item_self.__str__()

    def __setitem__(_item_self, key, value):
        setattr(_item_self, key, value)

    def __getitem__(_item_self, key):
        return getattr(_item_self, key)


class Pipeline(object):
    """
    :type node: Node
    :param node: Any node in a connected graph

    :type global_state:  object
    :param global_state: Any python object you want to use for holding global
                         state.

    Once Nodes have been wired together, they must be placed in a pipeline in
    order to process data.  If you would like to peform pipeline-level set up and
    tear-down logic, you can subclass from Pipeline and override the
    ``.begin()`` and ``end()`` methods.
    """
    def __init__(self, node, global_state=None):
        # get a reference to the top node of the connected nodes supplied.
        self.top_node = node.top_node

        # set the pipeline global state
        if global_state:
            self.global_state = global_state
        else:
            self.global_state = GlobalState()

        # initialize an empty lookup for nodes
        self._node_lookup = {}

        # initialize the pipeline
        self.initialize()

    def initialize(self, with_push=False):
        # define a flag to determine if the pipeline is "running" or not
        # it will only be true between when the .begin() is run and the
        # .end() method is run.
        self._is_running = False
        self._needs_log_header = False

        # initialize each node
        for node in self.top_node.all_nodes:
            self.initialize_node(node, with_push)

        # build the pipeline repr by cycling through all the nodes
        self.top_node.top_down_make_repr()

        # print a logging header if any node is logging
        if self._needs_log_header:
            sys.stdout.write('node_log,what,node_name,item\n')

    def initialize_node(self, node, with_push=False):
        # give node reference to pipeline attributes
        node.pipeline = self
        node.global_state = self.global_state

        # make node available for lookup
        self._node_lookup[node.name] = node

        # set the _process callable to be either logged or unlogged
        # TODO: might want to change this logic so that groupby nodes
        # can be logged
        if isinstance(node, GroupByNode):
            node._process = node._process_item
        elif node._logging is None:
            node._process = node.process
        else:
            self._needs_log_header = True
            node._process = node._logged_process

        # for single downstreams with no logging, can short-circuit all logic
        # and directly wire up the downstream process() callable as the
        # push callable on this node
        short_it = len(node._downstream_nodes) == 1
        short_it = short_it and node._downstream_nodes[0]._logging is None
        short_it = short_it and not isinstance(
            node._downstream_nodes[0], GroupByNode)

        # only initialize push if requsted
        if with_push:
            if short_it and node._logging is None:
                node.push = node._downstream_nodes[0].process

            # logged or multiple downstreams require logic, so no short circuit
            else:
                node.push = node._push

    def __getitem__(self, name):
        node = self._node_lookup.get(name, None)
        if node is None:
            raise KeyError('No node named \'{}\''.format(name))
        return node

    def __setitem__(self, name_to_replace, replacement_node):
        # make sure replacement node has proper name
        if name_to_replace != replacement_node.name:
            raise ValueError(
                'Replacement node must have the same name.'
            )

        # this will automatically raise error if the name doesn't exist
        node_to_replace = self[name_to_replace]

        removals = []
        additions = []

        for upstream in node_to_replace._upstream_nodes:
            removals.append((upstream, node_to_replace))
            additions.append((upstream, replacement_node))
            # handle special case of upstream being a routing node
            if hasattr(upstream, '_end_point_map'):
                upstream._end_point_map[name_to_replace] = replacement_node

        for downstream in node_to_replace._downstream_nodes:
            removals.append((node_to_replace, downstream))
            additions.append((replacement_node, downstream))

        for upstream, downstream in removals:
            upstream.remove_downstream(downstream)

        for upstream, downstream in additions:
            upstream.add_downstream(downstream)

        # initialize the replacement node within the pipeline
        self.initialize_node(replacement_node)

        # if top node was replaced then make sure pipeline nows about it
        if replacement_node.name == self.top_node.name:
            self.top_node = replacement_node

    def __getattribute__(self, name):
        """
        This should trap for the begin() and end() method calls and install
        pre/post hooks for when they are called either on the pipeline
        class or on any class derived from it.
        """
        if name == 'begin':
            def wrapper():
                super(Pipeline, self).__getattribute__(name)()
                self._begin()
            return wrapper
        elif name == 'end':
            def wrapper():
                self._end()
                return super(Pipeline, self).__getattribute__(name)()
            return wrapper
        elif name == 'reset':
            def wrapper():
                self._reset()
                return super(Pipeline, self).__getattribute__(name)()
            return wrapper
        else:
            return super(Pipeline, self).__getattribute__(name)

    def begin(self):
        """
        Override this method to execute any logic you want to perform before
        setting up nodes.  The ``.begin()`` method of all nodes will be called.
        """

    def end(self):
        """
        Override this method to execute any logic you want to perform after
        all nodes are done processing data. The ``.end()`` method of all nodes
        will be called.
        """

    def reset(self):
        """
        Override this with any logic you'd like to perform for resetting the
        pipeline. The ``.reset()`` method of all nodes will be called.
        """

    def _reset(self):
        self.top_node.top_down_call('reset')

    def _begin(self):
        self.top_node.top_down_call('_begin')
        self.initialize(with_push=True)
        self._is_running = True

    def _end(self):
        self.top_node.top_down_call('end')
        self._is_running = False

    def push(self, item):
        """
        You can manually push items to your pipeline using this meethod.

        :type item: object
        :param item: Any object you would like the pipeline to process
        """
        if not self._is_running:
            self.begin()
        self.top_node._process(item)

    def consume(self, iterable):
        """
        The pipeline will process each item in the iterable.

        :type iterable: A Python Iterable
        :param iterable: An iterable of objects you would like to process
        """
        self.begin()
        for item in iterable:
            self.top_node._process(item)
        return self.end()

    def plot(self, file_name='pipeline', kind='png'):
        """
        Call this method to produce a visualization of your pipeline.  The
        Graphviz library will be used to generate the image file.  Note that
        pipelines are automatically visualized in IPython notebook when they are
        evaluated as the last expression in a cell.

        :type file_name: str
        :param file_name: The name of the image file to save

        :type kind: str
        :param kind: The type of image file to produce (png, pdf)
        """
        self.top_node.plot(file_name, kind)
        return self

    def __str__(self):
        return (
            '\nPipeline\n'
            '----------------------------------'
            '----------------------------------\n{}'
            '----------------------------------'
            '----------------------------------\n'
        ).format(self._node_repr)

    def __repr__(self):
        return self.__str__()

    # No good way to test this unless you know dot is installed.
    def _repr_svg_(self):  # pragma: no cover
        return self.top_node._build_pydot_graph()._repr_svg_()


================================================
FILE: consecution/tests/__init__.py
================================================


================================================
FILE: consecution/tests/nodes_tests.py
================================================
import os
from collections import namedtuple
import shutil
import tempfile
from unittest import TestCase
import subprocess

from mock import patch

from consecution.nodes import Node


def dot_installed():
    p = subprocess.Popen(
        ['bash', '-c', 'which dot'], stdout=subprocess.PIPE)
    p.wait()
    result = p.stdout.read().decode("utf-8")
    return 'dot' in result


class FakeDigraph(object):  # pragma: no cover
    def __init__(self, *args, **kwargs):
        pass

    def node(self, *args, **kwargs):
        pass

    def edge(self, *args, **kwargs):
        pass

    def render(self, *args, **kwargs):
        raise RuntimeError('fake runtime error')


class NodeUnitTests(TestCase):
    def test_bad_logging_args(self):
        n = Node('a')
        with self.assertRaises(ValueError):
            n.log('bad')

    def test_bad_top_down_make_repr_call(self):
        n = Node('a')
        with self.assertRaises(ValueError):
            n.top_down_make_repr()

    def test_args_as_atts(self):
        n = Node('my_node', silly_attribute='silly')
        self.assertEqual(n.silly_attribute, 'silly')

    def test_comparisons(self):
        a = Node('a')
        b = Node('b')

        self.assertTrue(a == a)
        self.assertFalse(a == b)

        self.assertTrue(a < b)
        self.assertFalse(b < a)

    def test_bad_flattening(self):
        a = Node('a')
        with self.assertRaises(ValueError):
            a | 7

    @patch(
        'consecution.nodes.Node._build_pydot_graph', lambda a: FakeDigraph())
    def test_graphviz_not_installed(self):
        a = Node('a')
        b = Node('b')
        p = a | b
        with self.assertRaises(RuntimeError):
            p.plot()

    def test_no_getitem(self):
        a = Node('a')
        with self.assertRaises(ValueError):
            a['b']

    def test_bad_slot_name(self):
        a = Node('a')
        b = Node('b')
        with self.assertRaises(ValueError):
            a._get_exposed_slots(b, 'bad_arg')


class ExplicitWiringTests(TestCase):
    def setUp(self):
        self.temp_dir = tempfile.mkdtemp()

    def tearDown(self):
        shutil.rmtree(self.temp_dir)

    def do_wiring(self):
        self.do_explicit_wiring()

    def do_explicit_wiring(self):
        # define nodes
        a = Node('a')
        b = Node('b')
        c = Node('c')
        d = Node('d')
        e = Node('e')
        f = Node('f')
        g = Node('g')
        h = Node('h')
        i = Node('i')
        j = Node('j')
        k = Node('k')
        l = Node('l') # noqa.  okay to use l as var here
        m = Node('m')
        n = Node('n')

        # save a list of all nodes
        self.node_list = [a, b, c, d, e, f, g, h, i, j, k, l, m, n]
        self.top_node = a

        # wire up the nodes
        a.add_downstream(b)
        a.add_downstream(c)

        c.add_downstream(d)
        c.add_downstream(e)

        e.add_downstream(f)
        e.add_downstream(g)
        e.add_downstream(h)
        e.add_downstream(i)

        f.add_downstream(j)
        g.add_downstream(j)
        h.add_downstream(j)
        i.add_downstream(j)

        d.add_downstream(k)
        j.add_downstream(k)

        b.add_downstream(l)
        k.add_downstream(l)

        l.add_downstream(m)
        l.add_downstream(n)

        # same network in graph notation
        # a | [
        #    b,
        #    c | [
        #            d,
        #            e  | [f, g, h, i, my_router] | j
        #    ] | k
        # ] | l [m, n]

    def do_graph_wiring(self):
        # define nodes
        a = Node('a')
        b = Node('b')
        c = Node('c')
        d = Node('d')
        e = Node('e')
        f = Node('f')
        g = Node('g')
        h = Node('h')
        i = Node('i')
        j = Node('j')
        k = Node('k')
        l = Node('l') # noqa.  okay to use l as var here
        m = Node('m')
        n = Node('n')

        # save a list of all nodes
        self.node_list = [a, b, c, d, e, f, g, h, i, j, k, l, m, n]
        self.top_node = a

        a | [  # noqa
               b,
               c | [
                       d,
                       e | [f, g, h, i] | j
                   ] | k
            ] | l | [m, n]

    def test_connections(self):
        Conns = namedtuple('Conns', 'node upstreams downstreams')
        self.do_wiring()
        n = {
            node.name: Conns(
                node.name,
                {u.name for u in node._upstream_nodes},
                {d.name for d in node._downstream_nodes}
            )
            for node in self.node_list
        }
        self.assertEqual(n['a'].upstreams, set())
        self.assertEqual(n['a'].downstreams, {'b', 'c'})

        self.assertEqual(n['b'].upstreams, {'a'})
        self.assertEqual(n['b'].downstreams, {'l'})

        self.assertEqual(n['c'].upstreams, {'a'})
        self.assertEqual(n['c'].downstreams, {'d', 'e'})

        self.assertEqual(n['e'].upstreams, {'c'})
        self.assertEqual(n['e'].downstreams, {'f', 'g', 'h', 'i'})

        self.assertEqual(n['f'].upstreams, {'e'})
        self.assertEqual(n['f'].downstreams, {'j'})

        self.assertEqual(n['g'].upstreams, {'e'})
        self.assertEqual(n['g'].downstreams, {'j'})

        self.assertEqual(n['h'].upstreams, {'e'})
        self.assertEqual(n['h'].downstreams, {'j'})

        self.assertEqual(n['i'].upstreams, {'e'})
        self.assertEqual(n['i'].downstreams, {'j'})

        self.assertEqual(n['d'].upstreams, {'c'})
        self.assertEqual(n['d'].downstreams, {'k'})

        self.assertEqual(n['j'].upstreams, {'f', 'g', 'h', 'i'})
        self.assertEqual(n['j'].downstreams, {'k'})

        self.assertEqual(n['k'].upstreams, {'j', 'd'})
        self.assertEqual(n['k'].downstreams, {'l'})

        self.assertEqual(n['l'].upstreams, {'k', 'b'})
        self.assertEqual(n['l'].downstreams, {'m', 'n'})

    def test_all_nodes(self):
        self.do_wiring()
        expected_set = set(self.node_list)
        all_nodes_set = [
            set(node.all_nodes) for node in self.node_list
        ]
        self.assertTrue(all(
            [expected_set == found_set for found_set in all_nodes_set]))

    def test_top_node(self):
        self.do_wiring()
        top_node_set = {node.top_node for node in self.node_list}
        self.assertEqual(top_node_set, {self.top_node})

    def test_duplicate_node(self):
        self.do_wiring()

        # this test is funky in that it has assertion in a loop.
        # but I wanted to be sure cycles are detected everywhere
        for name in [n.name for n in self.top_node.all_nodes]:
            dup = Node(name)
            with self.assertRaises(ValueError):
                self.top_node.add_downstream(dup)

    def test_acyclic(self):
        self.do_wiring()

        # this test is funky in that it has assertion in a loop.
        # but I wanted to be sure dups are detected everywhere
        for node in self.top_node.all_nodes:
            with self.assertRaises(ValueError):
                node.add_downstream(self.top_node)

    def test_multi_root(self):
        self.do_wiring()
        other_root = Node('dual_root')
        other_root.add_downstream(self.top_node._downstream_nodes[0])

        with self.assertRaises(ValueError):
            other_root.top_node

    def test_non_node_connect(self):
        node = Node('a')
        other = 'not a node'
        with self.assertRaises(ValueError):
            node.add_downstream(other)

    def test_write(self):
        # don't run coverage on this because won't test travis with
        # both dot installed and not installed.
        if dot_installed():  # pragma: no cover
            self.do_wiring()
            out_file = os.path.join(self.temp_dir, 'out.png')
            self.top_node.plot(out_file)
            # uncomment the next line if you want to look at the graph
            os.system('cp {} /tmp'.format(out_file))

    def test_write_bad_kind(self):
        self.do_wiring()
        with self.assertRaises(ValueError):
            self.top_node.plot(kind='bad')

    def test_bad_search_direction(self):
        self.do_wiring()
        with self.assertRaises(ValueError):
            self.top_node.breadth_first_walk(direction='bad')

    def test_bad_search_method(self):
        self.do_wiring()
        with self.assertRaises(ValueError):
            self.top_node.walk(how='bad')


class DSLWiringTests(ExplicitWiringTests):
    def do_wiring(self):
        self.do_graph_wiring()


class TopDownCallTests(TestCase):
    def test_call_order_okay(self):

        # a toy class that holds a class variable
        # tracking what order objects get called in
        class MyNode(Node):
            call_list = []

            def end(self):
                self.__class__.call_list.append(self)

        a = MyNode('a')
        b = MyNode('b')
        c = MyNode('c')
        d = MyNode('d')
        e = MyNode('e')
        f = MyNode('f')
        g = MyNode('g')

        a | [
            b | c,
            d | e | f
        ] | g
        a.top_node.top_down_call('end')

        # make a dictionary with order in which nodes
        # were called
        call_number = {
            node: ind for (ind, node) in enumerate(a.__class__.call_list)}

        # make sure ording of one branch is right
        self.assertTrue(call_number[a] < call_number[b])
        self.assertTrue(call_number[b] < call_number[c])
        self.assertTrue(call_number[c] < call_number[g])

        # make sure ordering of other branch is okay
        self.assertTrue(call_number[a] < call_number[d])
        self.assertTrue(call_number[d] < call_number[e])
        self.assertTrue(call_number[e] < call_number[f])
        self.assertTrue(call_number[f] < call_number[g])


class BreadthFirstSearchTests(TestCase):
    def test_top_down_order(self):
        a = Node('a')
        b = Node('b')
        c = Node('c')
        d = Node('d')
        e = Node('e')
        f = Node('f')
        h = Node('h')
        i = Node('i')

        def silly_router(item):  # pragma: no cover
            return 0

        a | [b, c] | [d, e, f, silly_router] | [h, i]
        nodes = a.top_node.breadth_first_walk(
            direction='down', as_ordered_list=True)
        level5 = {nodes.pop() for nn in range(2)}
        level4 = {nodes.pop() for nn in range(3)}
        level3 = {nodes.pop() for nn in range(2)}
        level2 = {nodes.pop() for nn in range(2)}
        level1 = {nodes.pop() for nn in range(1)}

        self.assertEqual(level1, {a})
        self.assertEqual(level2, {b, c})
        self.assertEqual(len(level3), 2)
        self.assertEqual(level4, {d, e, f})
        self.assertEqual(level5, {h, i})

    def test_bottom_up_order(self):
        a = Node('a')
        b = Node('b')
        c = Node('c')
        d = Node('d')
        e = Node('e')
        f = Node('f')
        h = Node('h')

        def silly_router(item):  # pragma: no cover
            return 0

        a | [b, c] | [d, e, f, silly_router] | h
        nodes = h.breadth_first_walk(direction='up', as_ordered_list=True)
        nodes = nodes[::-1]
        level5 = {nodes.pop() for nn in range(1)}
        level4 = {nodes.pop() for nn in range(3)}
        level3 = {nodes.pop() for nn in range(2)}
        level2 = {nodes.pop() for nn in range(2)}
        level1 = {nodes.pop() for nn in range(1)}

        self.assertEqual(level1, {a})
        self.assertEqual(level2, {b, c})
        self.assertEqual(len(level3), 2)
        self.assertEqual(level4, {d, e, f})
        self.assertEqual(level5, {h})


class PrintingTests(TestCase):
    def setUp(self):
        # define nodes
        a = Node('a')
        b = Node('b')
        c = Node('c')
        d = Node('d')
        e = Node('e')
        f = Node('f')
        g = Node('g')
        h = Node('h')
        i = Node('i')
        j = Node('j')
        k = Node('k')
        l = Node('l') # noqa okay to use l here
        m = Node('m')
        n = Node('n')

        class DummyPipeline(object):
            pass

        pipeline = DummyPipeline()

        # save a list of all nodes
        self.node_list = [a, b, c, d, e, f, g, h, i, j, k, l, m, n]
        self.top_node = a

        def my_router(item):  # pragma: no cover
            return 'm'

        # wire up nodes using dsl
        a | [
               b,  # noqa
               c | [
                       d,
                       e | [f, g, h, i] | j
                   ] | k
            ] | l | [m, n, my_router]

        for node in self.top_node.all_nodes:
            node.pipeline = pipeline

    def test_nothing(self):
        self.top_node.top_down_make_repr()
        lines = sorted([
            line.strip()
            for line in self.top_node.pipeline._node_repr.split('\n')
            if line.strip()
        ])
        expected_lines = sorted([
            'a | [b, c]',
            'b | l',
            'c | [d, e]',
            'd | k',
            'e | [f, g, h, i]',
            'f | j',
            'g | j',
            'h | j',
            'i | j',
            'j | k',
            'k | l',
            'l | l.my_router',
            'l.my_router | [m, n]',
        ])
        self.assertEqual(lines, expected_lines)


class RoutingTests(TestCase):
    def test_nothing(self):
        a = Node('a')
        b = Node('b')
        c = Node('c')
        d = Node('d')
        e = Node('e')

        def silly_router(item):  # pragma: no cover
            return 0

        class ClassRouter(object):  # pragma: no cover
            def __call__(self, arg):
                return arg

        a | [b, c, ClassRouter()] | [d, e, silly_router]


================================================
FILE: consecution/tests/pipeline_tests.py
================================================
from __future__ import print_function
from collections import namedtuple, Counter
from unittest import TestCase
from consecution.nodes import Node, GroupByNode
from consecution.pipeline import Pipeline, GlobalState
from consecution.tests.testing_helpers import print_catcher

Item = namedtuple('Item', 'value parent source')


class Item(object):  # pragma: no cover (just a testing helper)
    def __init__(self, value, parent, source):
        self.value = value
        self.parent = parent
        self.source = source

    def build_source_list(self, source_list=None):
        source_list = [] if source_list is None else source_list
        source_list.append(self.source)
        if self.parent:
            self.parent.build_source_list(source_list)
        return source_list

    def get_path_string(self):
        return '|'.join([str(self.value)] + self.build_source_list()[::-1])

    def __str__(self):
        return self.get_path_string()

    def __repr__(self):
        return self.get_path_string()


class TestNode(Node):
    def process(self, item):
        self.push(
            Item(value=item.value, parent=item, source=self.name)
        )


class ResultNode(Node):
    def process(self, item):
        self.global_state.final_items.append(item)


class BadNode(Node):
    def begin(self):
        self.push(1)

    def process(self, item):  # pragma: no cover  this should never get hit.
        self.push(item)


def item_generator():
    for ind in range(1, 3):
        yield Item(
            value=ind,
            parent=None,
            source='generator'
        )


class TestBase(TestCase):
    def setUp(self):
        a = TestNode('a')
        b = TestNode('b')
        c = TestNode('c')
        d = TestNode('d')
        even = TestNode('even')
        odd = TestNode('odd')
        g = TestNode('g')

        def even_odd(item):
            return ['even', 'odd'][item.value % 2]

        a | b | [c, d] | [even, odd, even_odd] | g

        self.pipeline = Pipeline(a, global_state=GlobalState(final_items=[]))


class GlobalStateUnitTests(TestCase):
    def test_kwargs_passed(self):
        g = GlobalState(custom_name='custom')
        p = Pipeline(TestNode('a'), global_state=g)
        self.assertTrue(p.global_state.custom_name == 'custom')
        self.assertTrue(p.global_state['custom_name'] == 'custom')

    def test_printing(self):
        g = GlobalState(custom_name='custom')
        with print_catcher() as catcher1:
            print(g)

        with print_catcher() as catcher2:
            print(repr(g))

        self.assertTrue(
            'GlobalState(\'custom_name\')' in catcher1.txt)
        self.assertTrue(
            'GlobalState(\'custom_name\')' in catcher2.txt)


class OrOpTests(TestCase):
    def test_ror(self):
        a = Node('a')
        b = Node('b')
        c = Node('c')
        d = Node('d')

        p = Pipeline(a | ([b, c] | d))
        with print_catcher() as catcher:
            print(p)
        self.assertTrue('a | [b, c]' in catcher.txt)
        self.assertTrue('c | d' in catcher.txt)
        self.assertTrue('b | d' in catcher.txt)


class ManualFeedTests(TestCase):
    def test_manual_feed(self):

        class N(Node):
            def begin(self):
                self.global_state.out_list = []

            def process(self, item):
                self.global_state.out_list.append(item)

        pipeline = Pipeline(TestNode('a') | N('b'))
        pushed_list = []
        for item in item_generator():
            pushed_list.append(item)
            pipeline.push(item)
        pipeline.end()
        self.assertEqual(len(pipeline.global_state.out_list), 2)


class PipelineUnitTests(TestCase):
    def test_push_in_begin(self):
        pipeline = Pipeline(BadNode('a') | TestNode('b'))
        with self.assertRaises(AttributeError):
            pipeline.begin()

    def test_no_process(self):
        class N(Node):
            pass

        pipe = Pipeline(N('a') | N('b'))
        with self.assertRaises(NotImplementedError):
            pipe.consume(range(3))

    def test_bad_route(self):
        def bad_router(item):
            return 'bad'

        class N(Node):
            def process(self, item):
                self.push(item)

        pipeline = Pipeline(N('a') | [N('b'), N('c'), bad_router])

        with self.assertRaises(ValueError):
            pipeline.consume(range(3))

    def test_bad_node_lookup(self):
        pipeline = Pipeline(TestNode('a') | TestNode('b'))

        with self.assertRaises(KeyError):
            pipeline['c']

    def test_bad_replacement_name(self):
        pipeline = Pipeline(TestNode('a') | TestNode('b'))
        with self.assertRaises(ValueError):
            pipeline['b'] = TestNode('c')

    def test_flattened_list(self):
        pipeline = Pipeline(
            TestNode('a') | [[Node('b'), Node('c')]])

        with print_catcher() as catcher:
            print(pipeline)

        self.assertTrue('a | [b, c]' in catcher.txt)

    def test_logging(self):
        pipeline = Pipeline(TestNode('a') | TestNode('b'))
        pipeline['a'].log('output')
        pipeline['b'].log('input')
        with print_catcher() as catcher:
            pipeline.consume(item_generator())
        text = """
            node_log,what,node_name,item
            node_log,output,a,1|generator|a
            node_log,input,b,1|generator|a
            node_log,output,a,2|generator|a
            node_log,input,b,2|generator|a
        """
        for line in text.split('\n'):
            self.assertTrue(line.strip() in catcher.txt)

    def test_reset(self):
        class N(Node):
            def begin(self):
                self.was_reset = False

            def process(self, item):
                self.push(item)

            def reset(self):
                self.was_reset = True

        pipe = Pipeline(N('a') | N('b'))
        pipe.consume(range(3))
        self.assertFalse(pipe['a'].was_reset)
        self.assertFalse(pipe['b'].was_reset)

        pipe.reset()

        self.assertTrue(pipe['a'].was_reset)
        self.assertTrue(pipe['b'].was_reset)


class LoggingTests(TestBase):
    def test_logging(self):
        self.pipeline['g'].log('input')

        with print_catcher() as printer:
            self.pipeline.consume(item_generator())

        counter = Counter()
        for line in printer.lines():
            even_odd = line.split('|')[-1]
            counter.update({even_odd: 1})
        self.assertEqual(counter['even'], 2)
        self.assertEqual(counter['odd'], 2)


class ReplacementTests(TestBase):
    def test_replace_first(self):
        class Replacement(Node):
            def process(self, item):
                self.push(
                    Item(value=10 * item.value, parent=item, source=self.name)
                )

        self.pipeline['a'] = Replacement('a')
        self.pipeline['a'].log('output')

        with print_catcher() as printer:
            self.pipeline.consume(item_generator())
        self.assertEqual(printer.txt.count('10'), 1)
        self.assertEqual(printer.txt.count('20'), 1)

    def test_replace_even(self):
        class Replacement(Node):
            def process(self, item):
                self.push(
                    Item(value=10 * item.value, parent=item, source=self.name)
                )

        self.pipeline['even'] = Replacement('even')
        self.pipeline['g'].log('output')

        with print_catcher() as printer:
            self.pipeline.consume(item_generator())
        self.assertEqual(printer.txt.count('1'), 2)
        self.assertEqual(printer.txt.count('20'), 2)

    def test_replace_no_router(self):
        a = TestNode('a')
        b = TestNode('b')
        pipe = Pipeline(a | b)
        pipe['b'] = TestNode('b')
        with print_catcher() as catcher:
            print(pipe)
        self.assertTrue('a | b' in catcher.txt)


class ConsumingTests(TestBase):
    def test_even_odd(self):
        self.pipeline['g'].add_downstream(
            ResultNode('result_node')
        )

        self.pipeline.consume(item_generator())

        expected_path_set = set([
            '1|generator|a|b|c|odd|g',
            '1|generator|a|b|d|odd|g',
            '2|generator|a|b|c|even|g',
            '2|generator|a|b|d|even|g',
        ])
        path_set = set(
            item.get_path_string() for item in
            self.pipeline.global_state.final_items
        )
        self.assertEqual(expected_path_set, path_set)


class ConstructingTests(TestBase):
    def test_printing(self):
        lines = repr(self.pipeline).split('\n')
        self.assertEqual(len(lines), 13)

    def test_plotting(self):
        # don't want to force a mock dependency, so make a simple mock here
        args_kwargs = []

        def return_calls(*args, **kwargs):
            args_kwargs.append(args)
            args_kwargs.append(kwargs)

        # assign my mock to the top node plot function
        self.pipeline.top_node.plot = return_calls

        # call pipeline plot
        self.pipeline.plot()

        # make sure top node plot was properly called
        self.assertEqual(args_kwargs[0], ('pipeline', 'png'))
        self.assertEqual(args_kwargs[1], {})


class Batch(GroupByNode):
    def begin(self):
        self.global_state.batches = []

    def key(self, item):
        return item // 3

    def process(self, batch):
        self.global_state.batches.append(batch)


class GroupByTests(TestCase):
    def test_batching(self):
        pipe = Pipeline(Batch('a'))
        pipe.consume(range(9))
        self.assertEqual(
            pipe.global_state.batches,
            [[0, 1, 2], [3, 4, 5], [6, 7, 8]]
        )

    def test_undefined_key(self):
        class B(GroupByNode):
            def process(self, item):  # pragma: no cover
                pass

        pipe = Pipeline(B('a'))

        with self.assertRaises(NotImplementedError):
            pipe.consume(range(9))

    def test_undefined_process(self):
        class B(GroupByNode):
            def key(self, item):
                pass

        pipe = Pipeline(B('a'))

        with self.assertRaises(NotImplementedError):
            pipe.consume(range(9))


================================================
FILE: consecution/tests/testing_helpers.py
================================================
import sys
from contextlib import contextmanager


# These don't need to covered.  They are just tesing utilities
@contextmanager
def print_catcher(buff='stdout'):  # pragma: no cover
    if buff == 'stdout':
        sys.stdout = Printer()
        yield sys.stdout
        sys.stdout = sys.__stdout__
    elif buff == 'stderr':
        sys.stderr = Printer()
        yield sys.stderr
        sys.stderr = sys.__stderr__
    else:  # pragma: no cover  This is just to help testing. No need to cover.
        raise ValueError('buff must be either \'stdout\' or \'stderr\'')


class Printer(object):  # pragma: no cover
    def __init__(self):
        self.txt = ""

    def write(self, txt):
        self.txt += txt

    def lines(self):
        for line in self.txt.split('\n'):
            yield line.strip()


================================================
FILE: consecution/tests/utils_tests.py
================================================
from __future__ import print_function

from unittest import TestCase
from consecution.utils import Clock
import time
from consecution.tests.testing_helpers import print_catcher


class ClockTests(TestCase):
    def test_bad_start(self):
        clock = Clock()
        with self.assertRaises(ValueError):
            clock.start()

    def test_printing(self):
        clock = Clock()
        with clock.running('a', 'b', 'c'):
            with clock.paused('a'):
                time.sleep(.1)
                with clock.paused('b'):
                    time.sleep(.1)

        with print_catcher() as printer:
            print(repr(clock))

        names = []
        for ind, line in enumerate(printer.txt.split('\n')):
            if line:
                if ind > 0:
                    names.append(line.split()[-1])

        self.assertEqual(names, ['c', 'b', 'a'])

    def test_get_time_of_running(self):
        clock = Clock()
        with clock.running('a'):
            time.sleep(.1)
            delta1 = int(10 * clock.get_time())
            time.sleep(.1)
        delta2 = int(10 * clock.get_time())
        self.assertEqual(delta1, 1)
        self.assertEqual(delta2, 2)

    def test_pausing(self):
        clock = Clock()

        with clock.running('a', 'b', 'c'):
            time.sleep(.1)
            with clock.paused('b', 'c'):
                time.sleep(.1)

        self.assertEqual(int(10 * clock.get_time('a')), 2)
        self.assertEqual(int(10 * clock.get_time('b')), 1)
        self.assertEqual(int(10 * clock.get_time('c')), 1)
        self.assertEqual(
            {int(10 * v) for v in clock.get_time().values()},
            {1, 2}
        )

    def test_stop_all(self):
        clock = Clock()
        clock.start('a', 'b')
        time.sleep(.1)
        clock.stop()
        self.assertEqual(int(10 * clock.get_time('a')), 1)
        self.assertEqual(int(10 * clock.get_time('b')), 1)

    def test_reset_all(self):
        clock = Clock()
        clock.start('a', 'b')
        time.sleep(.1)
        clock.stop('b')
        self.assertEqual(len(clock.delta), 1)
        clock.reset()
        self.assertEqual(len(clock.get_time()), 0)

    def test_double_calls(self):
        clock = Clock()
        clock.start('a')
        clock.start('a')
        time.sleep(.1)
        clock.stop('a')
        clock.stop('a')
        self.assertEqual(int(round(10 * clock.get_time())), 1)
        clock.reset('a')
        clock.reset('a')
        clock.reset('b')
        clock.reset('b')
        self.assertEqual(clock.get_time(), {})

    def test_get_time_delta_only(self):
        clock = Clock()
        clock.start('a')
        clock.stop('a')
        self.assertEqual(clock.get_time('f'), {})


================================================
FILE: consecution/utils.py
================================================
from collections import Counter
from contextlib import contextmanager
import datetime


class Clock(object):
    def __init__(self):
        # see the reset method for instance attributes
        self.delta = Counter()
        self.active_start_times = dict()

    @contextmanager
    def running(self, *names):
        self.start(*names)
        yield
        self.stop(*names)

    @contextmanager
    def paused(self, *names):
        self.stop(*names)
        yield
        self.start(*names)

    def start(self, *names):
        if not names:
            raise ValueError('You must provide at least one name to start')

        for name in names:
            if name not in self.active_start_times:
                self.active_start_times[name] = datetime.datetime.now()

    def stop(self, *names):
        ending = datetime.datetime.now()
        if not names:
            names = list(self.active_start_times.keys())
        for name in names:
            if name in self.active_start_times:
                starting = self.active_start_times.pop(name)
                self.delta.update({name: (ending - starting).total_seconds()})

    def reset(self, *names):
        if not names:
            names = list(self.active_start_times.keys())
            names.extend(list(self.delta.keys()))
        for name in names:
            if name in self.delta:
                self.delta.pop(name)
            if name in self.active_start_times:
                self.active_start_times.pop(name)

    def get_time(self, *names):
        ending = datetime.datetime.now()
        if not names:
            names = list(self.delta.keys())
            names.extend(list(self.active_start_times.keys()))

        delta = Counter()
        for name in names:
            if name in self.delta:
                delta.update({name: self.delta[name]})
            elif name in self.active_start_times:
                delta.update(
                    {
                        name: (
                            ending - self.active_start_times[name]
                        ).total_seconds()
                    }
                )
        if len(delta) == 1:
            return delta[list(delta.keys())[0]]
        else:
            return dict(delta)

    def __str__(self):
        records = sorted(self.delta.items(), key=lambda t: t[1], reverse=True)
        records = [('%0.6f' % r[1], r[0]) for r in records]

        out_list = ['{: <15s}{}'.format('seconds', 'name')]

        for rec in records:
            out_list.append('{: <15s}{}'.format(*rec))

        return '\n'.join(out_list)

    def __repr__(self):
        return self.__str__()


================================================
FILE: docker/Dockerfile
================================================
FROM ubuntu:xenial

# root is the home directory
WORKDIR /root

ADD simple_example.py /root/simple_example.py

# set up the system tools including conda
RUN \
    rm /bin/sh && ln -s /bin/bash /bin/sh && \
    apt-get update && \
    apt-get install -y vim && \
    apt-get install -y git  && \
    apt-get install -y wget && \
    apt-get install -y curl && \
    apt-get install -y graphviz && \
    apt-get install -y python-dev

RUN \
    curl -sS https://bootstrap.pypa.io/get-pip.py | python

RUN \
    pip install git+https://github.com/robdmc/consecution.git


================================================
FILE: docker/docker_build.sh
================================================
#! /usr/bin/env bash

docker build . -t consecution


================================================
FILE: docker/docker_run.sh
================================================
#! /usr/bin/env bash

docker run -it  --rm -v  $(pwd):/root/shared consecution /bin/bash


================================================
FILE: docker/simple_example.py
================================================
#! /usr/bin/env python

# TODO: make the consecution install in the docker file read from pip
from __future__ import print_function

from consecution import Node, Pipeline


class N(Node):
    def process(self, item):
        print(item, self.name)
        self.push(item)


p = Pipeline(
    N('a') | [N('b'), N('c')] | N('d')
)
p.plot()

p.consume(range(5))


================================================
FILE: docs/Makefile
================================================
# Makefile for Sphinx documentation
#

# You can set these variables from the command line.
SPHINXOPTS    =
SPHINXBUILD   = sphinx-build
PAPER         =
BUILDDIR      = _build

# User-friendly check for sphinx-build
ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1)
$(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/)
endif

# Internal variables.
PAPEROPT_a4     = -D latex_paper_size=a4
PAPEROPT_letter = -D latex_paper_size=letter
ALLSPHINXOPTS   = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .
# the i18n builder cannot share the environment and doctrees with the others
I18NSPHINXOPTS  = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .

.PHONY: help clean html dirhtml singlehtml pickle json htmlhelp epub latex latexpdf text man changes linkcheck doctest gettext

help:
	@echo "Please use \`make <target>' where <target> is one of"
	@echo "  html       to make standalone HTML files"
	@echo "  dirhtml    to make HTML files named index.html in directories"
	@echo "  singlehtml to make a single large HTML file"
	@echo "  pickle     to make pickle files"
	@echo "  json       to make JSON files"
	@echo "  htmlhelp   to make HTML files and a HTML help project"
	@echo "  epub       to make an epub"
	@echo "  latex      to make LaTeX files, you can set PAPER=a4 or PAPER=letter"
	@echo "  latexpdf   to make LaTeX files and run them through pdflatex"
	@echo "  latexpdfja to make LaTeX files and run them through platex/dvipdfmx"
	@echo "  text       to make text files"
	@echo "  man        to make manual pages"
	@echo "  texinfo    to make Texinfo files"
	@echo "  info       to make Texinfo files and run them through makeinfo"
	@echo "  gettext    to make PO message catalogs"
	@echo "  changes    to make an overview of all changed/added/deprecated items"
	@echo "  xml        to make Docutils-native XML files"
	@echo "  pseudoxml  to make pseudoxml-XML files for display purposes"
	@echo "  linkcheck  to check all external links for integrity"
	@echo "  doctest    to run all doctests embedded in the documentation (if enabled)"

clean:
	rm -rf $(BUILDDIR)/*

html:
	$(SPHINXBUILD) -W -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
	@echo
	@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."

dirhtml:
	$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml
	@echo
	@echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml."

singlehtml:
	$(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml
	@echo
	@echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml."

pickle:
	$(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle
	@echo
	@echo "Build finished; now you can process the pickle files."

json:
	$(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json
	@echo
	@echo "Build finished; now you can process the JSON files."

htmlhelp:
	$(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp
	@echo
	@echo "Build finished; now you can run HTML Help Workshop with the" \
	      ".hhp project file in $(BUILDDIR)/htmlhelp."

epub:
	$(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub
	@echo
	@echo "Build finished. The epub file is in $(BUILDDIR)/epub."

latex:
	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
	@echo
	@echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex."
	@echo "Run \`make' in that directory to run these through (pdf)latex" \
	      "(use \`make latexpdf' here to do that automatically)."

latexpdf:
	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
	@echo "Running LaTeX files through pdflatex..."
	$(MAKE) -C $(BUILDDIR)/latex all-pdf
	@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."

latexpdfja:
	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
	@echo "Running LaTeX files through platex and dvipdfmx..."
	$(MAKE) -C $(BUILDDIR)/latex all-pdf-ja
	@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."

text:
	$(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text
	@echo
	@echo "Build finished. The text files are in $(BUILDDIR)/text."

man:
	$(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man
	@echo
	@echo "Build finished. The manual pages are in $(BUILDDIR)/man."

texinfo:
	$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
	@echo
	@echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo."
	@echo "Run \`make' in that directory to run these through makeinfo" \
	      "(use \`make info' here to do that automatically)."

info:
	$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
	@echo "Running Texinfo files through makeinfo..."
	make -C $(BUILDDIR)/texinfo info
	@echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo."

gettext:
	$(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale
	@echo
	@echo "Build finished. The message catalogs are in $(BUILDDIR)/locale."

changes:
	$(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes
	@echo
	@echo "The overview file is in $(BUILDDIR)/changes."

linkcheck:
	$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
	@echo
	@echo "Link check complete; look for any errors in the above output " \
	      "or in $(BUILDDIR)/linkcheck/output.txt."

doctest:
	$(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest
	@echo "Testing of doctests in the sources finished, look at the " \
	      "results in $(BUILDDIR)/doctest/output.txt."

xml:
	$(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml
	@echo
	@echo "Build finished. The XML files are in $(BUILDDIR)/xml."

pseudoxml:
	$(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml
	@echo
	@echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml."


================================================
FILE: docs/conf.py
================================================
# -*- coding: utf-8 -*-
#
import inspect
import os
import re


def get_version():
    """Obtain the packge version from a python file e.g. pkg/__init__.py
    See <https://packaging.python.org/en/latest/single_source_version.html>.
    """
    file_dir = os.path.realpath(os.path.dirname(__file__))
    with open(
            os.path.join(file_dir, '..', 'consecution', '__init__.py')) as f:
        txt = f.read()
    version_match = re.search(
        r"""^__version__ = ['"]([^'"]*)['"]""", txt, re.M)
    if version_match:
        return version_match.group(1)
    raise RuntimeError("Unable to find version string.")


# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#sys.path.insert(0, os.path.abspath('.'))

# -- General configuration ------------------------------------------------

extensions = [
    'sphinx.ext.autodoc',
    'sphinx.ext.intersphinx',
    #'sphinx.ext.viewcode',
]

# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']

# The suffix of source filenames.
source_suffix = '.rst'

# The master toctree document.
master_doc = 'toc'

# General information about the project.
project = 'consecution'
copyright = '2017, Rob deCarvalho'

# The short X.Y version.
version = get_version()
# The full version, including alpha/beta/rc tags.
release = version

exclude_patterns = ['_build']

# The name of the Pygments (syntax highlighting) style to use.
pygments_style = 'sphinx'

intersphinx_mapping = {
    'python': ('http://docs.python.org/3.4', None),
    'django': ('http://django.readthedocs.org/en/latest/', None),
    #'celery': ('http://celery.readthedocs.org/en/latest/', None),
}

# -- Options for HTML output ----------------------------------------------

html_theme = 'default'
#html_theme_path = []

on_rtd = os.environ.get('READTHEDOCS', None) == 'True'
if not on_rtd:  # only import and set the theme if we're building docs locally
    import sphinx_rtd_theme
    html_theme = 'sphinx_rtd_theme'
    html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]

# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
# html_static_path = ['_static']

# Custom sidebar templates, maps document names to template names.
#html_sidebars = {}

# Additional templates that should be rendered to pages, maps page names to
# template names.
#html_additional_pages = {}

# If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
html_show_sphinx = False

# If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
html_show_copyright = True

# Output file base name for HTML help builder.
htmlhelp_basename = 'consecutiondoc'


# -- Options for LaTeX output ---------------------------------------------

latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
#'papersize': 'letterpaper',

# The font size ('10pt', '11pt' or '12pt').
#'pointsize': '10pt',

# Additional stuff for the LaTeX preamble.
#'preamble': '',
}

# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
#  author, documentclass [howto, manual, or own class]).
latex_documents = [
  ('index', 'consecution.tex', 'consecution Documentation',
   'Rob deCarvalho', 'manual'),
]

# -- Options for manual page output ---------------------------------------

# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
    ('index', 'consecution', 'consecution Documentation',
     ['Rob deCarvalho'], 1)
]

# -- Options for Texinfo output -------------------------------------------

# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
#  dir menu entry, description, category)
texinfo_documents = [
  ('index', 'consecution', 'consecution Documentation',
   'Rob deCarvalho', 'consecution', 'A short description',
   'Miscellaneous'),
]


def process_django_model_docstring(app, what, name, obj, options, lines):
    """
    Does special processing for django model docstrings, making docs for
    fields in the model.
    """
    # This causes import errors if left outside the function
    from django.db import models

    # Only look at objects that inherit from Django's base model class
    if inspect.isclass(obj) and issubclass(obj, models.Model):
        # Grab the field list from the meta class
        fields = obj._meta.fields

        for field in fields:
            # Decode and strip any html out of the field's help text
            help_text = strip_tags(force_unicode(field.help_text))

            # Decode and capitalize the verbose name, for use if there isn't
            # any help text
            verbose_name = force_unicode(field.verbose_name).capitalize()

            if help_text:
                # Add the model field to the end of the docstring as a param
                # using the help text as the description
                lines.append(':param %s: %s' % (field.attname, help_text))
            else:
                # Add the model field to the end of the docstring as a param
                # using the verbose name as the description
                lines.append(':param %s: %s' % (field.attname, verbose_name))

            # Add the field's type to the docstring
            lines.append(':type %s: %s' % (field.attname, type(field).__name__))

    # Return the extended docstring
    return lines


def setup(app):
    # Register the docstring processor with sphinx
    app.connect('autodoc-process-docstring', process_django_model_docstring)


================================================
FILE: docs/index.rst
================================================

Overview
=============================
Consecution is:
  * An easy-to-use pipeline abstraction inspired by
    `Apache Storm Topologies <http://storm.apache.org/releases/current/Tutorial.html>`_.
  * Designed to simplify building ETL pipelines that are robust and easy to test
  * A system for wiring together simple processing nodes to form a DAG, which is fed with a python iterable
  * Built using synchronous, single-threaded execution strategies designed to run efficiently on a single core
  * Implemented in pure-python with optional requirements that are needed only for graph visualization
  * Written with 100% test coverage

See the 
`Github project page <https://github.com/robdmc/consecution>`_.
for examples of how to use `consecution`.


================================================
FILE: docs/ref/consecution.rst
================================================
.. _ref-consecution:

API documentation
==================

Node
----
Nodes are the fundamental processing unit in consecution.  A node is created by
inheriting from the `consecution.Node` class.  You are free to declare as many
attributes and methods on a node class as you wish.  You should not override the
constructor unless you really know what you're doing.  Instead, any
initialization you wish to perform can be carried out in the `.begin()` method.
In the descriptions below, it is assumed that the nodes being discussed have
been wired together into a pipeline and are ready to consume items.

See the 
`Github README
<https://github.com/robdmc/consecution/blob/master/README.md>`_
for examples  of how to wire nodes into pipelines.

Reserved Method Names
~~~~~~~~~~~~~~~~~~~~~
The following Node methods are not intended to be overridden, so you should not
define methods with these names in your node implementations unless you really
know what you are doing.

*  `top_node`
*  `initial_node_set`
*  `terminal_node_set`
*  `root_nodes`
*  `all_nodes`
*  `log`
*  `top_down_make_repr`
*  `top_down_call`
*  `depth_first_search`
*  `breadth_first_search`
*  `search`
*  `add_downstream`
*  `remove_downstream`
*  `plot`

There are also a number of private method names you should avoid.  These can be
identified by looking at the `source code 
<https://github.com/robdmc/consecution/blob/master/consecution/nodes.py>`_


Examples
~~~~~~~~

Here is the simplest possible node you could construct:

.. code-block:: python

    from consecution import Node

    class MyNode(Node):
        def process(self, item):
            self.push(item)

All nodes acquire a `.push()` method when they are wired into a pipeline.  You
can call this method anywhere in your class except in the `.begin()` method.
The `.push(item)` method will take its argument and send it to the `.process()`
methods of the nodes that are immediately downstream in your pipeline graph.

Here is an example node defining all methods you can override.  The
functionality of each method is explained in the code comments.

.. code-block:: python

    from consecution import Node

    class MyNode(Node):
        def begin(self):
            # This sets up whatever state you want to exist before the
            # node begins processing any data.  You can think of it as an
            # init method that runs just before the node starts processing.
            # In this example, we initialize a simple counter
            self.counter = 0

        def process(self, item):
            # This is the method that defines the processing you want to perform
            # on every item the node processes.  You can place whatever logic
            # you want here, including calls the the .push() method.
            # In this example, we update the counter and push the item
            # downstream.
            self.counter += 1
            self.push(item)

        def end(self):
            # This method is called right after all items are processed.
            # This happens  when the iterator being consumed by the pipeline
            # is exhausted.  At that point the .end() methods of all nodes
            # in the pipeline are called.  This is a good place for you to
            # push any summary information downstream.
            # In this example we push the results of our counter
            self.push(self.counter)

        def reset(self):
            # A pipeline can be reused and reset back to its initial condition.
            # It does this by calling the .reset() method of all its member
            # nodes.  You can place whatever code you want here to reset your
            # node to its initial state.
            # In this example, we simply reset the counter.
            self.counter = 0

Node API Documentation
~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: consecution.nodes.Node
    :members:

GroupBy Node
~~~~~~~~~~~~~~~~~~~~~~
Consecution provides a special Node class specifically designed to do grouping.
It works in much the sameway as Python's built in
``itertools.groupby`` function.  It expects to process nodes in key-sorted
order.  In addition to the ``.process()`` method required of all nodes, you must
also define a ``.key()`` method that will extract a key from each item being
processed.  See the Github project page for an example of using Groupby.

.. autoclass:: consecution.nodes.GroupByNode
    :members:


Manually Connecting Nodes
-------------------------
The Node base class is equipped with an ``.add_downstream(other_node)`` method.
This method provides detailed control over how nodes are wired together. It
simply adds ``other_node`` as a downstream relation.

Here is an example of creating a pipeline with one top node that broadcasts
items to two downstream nodes, and then collects their results into a single
output node.

.. code-block:: python

    from consecution import Pipeline, Node
    from __future__ import print_function

    class SimpleNode(Node):
        def process(self, item):
            print('{} processing {}'.format(self.name, item))
            self.push(item)

    top = SimpleNode('top')
    left = SimpleNode('left')
    right = SimpleNode('right')
    output = SimpleNode('output')

    top.add_downstream(left)
    top.add_downstream(right)

    left.add_downstream(output)
    right.add_downstream(output)

    pipe = Pipeline(top)

    pipe.consume(range(2))


Node Connection Mini-language
-----------------------------
Consecution provides a concise domain-specific-language (DSL) for creating
directed acyclic graphs.  This is the preferred method for connecting nodes into
a pipeline.  However, you may occasionally find that your desired topology is not
easy to express in the DSL.  For these situations, consecution provides a
lower-level escape hatch that allowes you to manually connect two nodes
together.  These two levels of abstraction provide a very powerful interface for
constructing complex pipelines.

The DSL is inspired by the unix syntax for chaining together the inputs and
outputs of different programs at the bash prompt.  You use the pipe symbol ``|``
to connect nodes together.  These pipe operators will always return an object of
one of the nodes in your connected topology. Below is an example of creating a
simple linear pipeline.

.. code-block:: python

    from consecution import Pipeline, Node
    from __future__ import print_function

    class SimpleNode(Node):
        def process(self, item):
            print('{} processing {}'.format(self.name, item))
            self.push(item)

    left = SimpleNode('left')
    middle = SimpleNode('middle')
    right = SimpleNode('right')

    # wire nodes together with bash-like pipe operator
    node_object = left | middle | right

    # You can now pass the node object into a pipeline constructor
    pipe = Pipeline(node_object)
    pipe.consume(range(2))

In order to create a directed acyclic graph (DAG) you need four basic
constructs:

* Send data from one node to a single other node
* Broadcast data from one node to a set of other nodes
* Route data from one node to one of a set of other nodes
* Gather output from several nodes into one node.

The DSL provides mechanisms for each of these constructs, and we will look at
each in turn.

Send data from single node to single node
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Use simple bash-like pipe syntax to send data from a single node to another
node.

.. code-block:: python

    # Send data from one to to a single other node using bash-like piping.
    node1 | node2


Broadcast data from single node to multiple node
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Broadcasting is accomplished by piping to a list of nodes.  In the following
example, ``node1`` will send each item it pushes to ``node2``, ``node3``, and
``node4``.

.. code-block:: python

    # Broadcast to a set of nodes by piping to a list
    node1 | [node2, node3, node4]

Routing from one node to one of multiple nodes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Routing is accomplished by piping to a list that contains a single callable and
any number of nodes.  The following example will send even numbers to
``even_node`` and odd numbers to ``odd_node``.

.. code-block:: python

    # Define a node class
    class N(Node):
        def process(self, item):
            self.push(item)

    # Define a routing function.  It takes a single argument being the item
    # you pushed.  It should return a string with the name of the node
    # to which that item should be routed.
    def route_func(item):
        if item % 2 == 0:
            return 'even_node'
        else:
            return 'odd_node'

    # Pipe to a list of nodes and a callable to achieve routing
    N('top_node') | [N('even_node'), N('odd_node'), route_func]


Gather output from multiple nodes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Gathering output from a set of nodes is as simple as piping a list of nodes (and
possibly a route function) to a single node.  In this example, the outputs of
``node2``, ``node3``, and ``node4`` will all be sent to ``node5``.

.. code-block:: python

    # Broadcast to a set of nodes by piping to a list
    node1 | [node2, node3, node4] | node5


Pipeline
-----------------
Once nodes are wired together, they need to be encapsulated into a pipeline
before they can operate on data.  This is done by passing any node in the
network as the argument to the ``Pipeline`` constructor.  On construction, the
pipeline will ensure you have a valid processing graph and will execute
initialization code to ensure that the nodes are efficiently connected.
Immediately after construction, the pipeline is ready to consume data.

Consuming Iterables
~~~~~~~~~~~~~~~~~~~
When the ``.consume(iterable)`` method is called a sequence of events occur in
exactly this order.

#. The ``.begin()`` method on the pipeline object is called.  You can override
   this method to perform any task you'd like.

#. The ``.begin()`` methods of all nodes in the network are called.  They are
   called in top-down order.  What this means is that the ``.begin()`` method of
   a node is guaranteed to not be called until the ``.begin()`` methods of all
   its ancestors have been called.

#. Items are read from the iterable argument supplied to the ``.consume()``
   method.  These are fed through the topology of the processing graph one by
   one.  Each item is completely processed by the graph before the next one is
   lifted off the iterable.

#. The ``.end()`` methods of all nodes are called in top-down order.

#. The ``.end()`` method of the pipeline is called.


Manually feeding Pipeline
~~~~~~~~~~~~~~~~~~~~~~~~~~
In addition to consuming iterables, you can manually feed pipelines using the
``.push()`` method on the pipeline itself.  When you are finished pushing items,
you can manually call the ``.end()`` method.  Here is an example.

.. code-block:: python

    from consecution import Node, Pipeline
    from __future__ import print_function

    class N(Node):
        def process(self, item):
            print(item)
            self.push(item)

    pipe = Pipeline(N('first') | N('second'))
    for nn in range(2):
        pipe.push(nn)
    pipe.end()


Pipeline API Documentation
~~~~~~~~~~~~~~~~~~~~~~~~~~
Pipelines support dictionary-like access to their nodes.  Here are examples.

.. code-block:: python

    from consecution import Node, Pipeline

    # Define a node 
    class N(Node):
        def process(self, item):
            self.push(item)

    # Create a pipeline with two nodes
    pipe = Pipeline(N('first') | N('second'))

    # Get reference to a node with dictionary syntax
    first = pipe['first']

    # Replace a node with dictionary-like syntax
    pipe['first'] = N('first')


.. autoclass:: consecution.pipeline.Pipeline
    :members:


GlobalState
-----------------
The ``GlobalState`` class is a simple python class that supports both
dictionary-like and object-like attribute access.  An object of this class will
be used as the default ``global_state`` attribute of a pipeline if you don't
explicitly provide one in the constructor.

.. autoclass:: consecution.pipeline.GlobalState
    :members:


================================================
FILE: docs/toc.rst
================================================
Table of Contents
=================

.. toctree::
   :maxdepth: 2

   index
   ref/consecution


================================================
FILE: pandashells.md
================================================
Pandashells One-liner Example
===

<a href="https://github.com/robdmc/pandashells">Pandashells</a> lets you use <a
href="http://pandas.pydata.org/">Pandas</a> from the bash command line.  It
allows you to combine unix command-line tools (awk, grep, sed, etc.) with the
power of Pandas Dataframes and Matplotlib visualization.

Here is a one-liner that performs the exact same aggregation demonstrated by the
example consecution pipeline.

```bash
cat sample_data.csv | \
p.df 'df["group"] = ["adult" if a>=18 else "child" for a in df.age]' | \
p.df 'df.pivot_table(index="group", columns="gender", values="spent", margins=True, aggfunc=sum).fillna(0)' \
-o table index
```


================================================
FILE: publish.py
================================================
import subprocess

subprocess.call('pip install wheel'.split())
subprocess.call('python setup.py clean --all'.split())
subprocess.call('python setup.py sdist'.split())
# subprocess.call('pip wheel --no-index --no-deps --wheel-dir dist dist/*.tar.gz'.split())
subprocess.call('python setup.py register sdist bdist_wheel upload'.split())


================================================
FILE: sample_data.csv
================================================
gender,age,spent
male,11,39.39
female,10,34.72
female,15,40.02
male,19,26.27
male,13,21.22
female,40,23.17
female,52,33.42
male,33,39.52
female,16,28.65
male,60,26.74


================================================
FILE: setup.cfg
================================================
[nosetests]
nocapture=1
verbosity=1
with-coverage=1
cover-branches=1
#cover-min-percentage=100
cover-package=consecution

[coverage:report]
show_missing=True
fail_under=100
exclude_lines =
    # Have to re-enable the standard pragma
    pragma: no cover

    # Don't complain if tests don't hit defensive assertion code:
    raise NotImplementedError

[coverage:run]
omit =
    consecution/version.py
    consecution/__init__.py


[flake8]
max-line-length = 120
exclude = docs,env,*.egg
max-complexity = 10
ignore = E402

[build_sphinx]
source-dir = docs/
build-dir  = docs/_build
all_files  = 1

[upload_sphinx]
upload-dir = docs/_build/html

[bdist_wheel]
universal = 1


================================================
FILE: setup.py
================================================
#!/usr/bin/env python

import io
import os
import re
from setuptools import setup, find_packages

file_dir = os.path.dirname(__file__)


def read(path, encoding='utf-8'):
    path = os.path.join(os.path.dirname(__file__), path)
    with io.open(path, encoding=encoding) as fp:
        return fp.read()


def version(path):
    """Obtain the packge version from a python file e.g. pkg/__init__.py
    See <https://packaging.python.org/en/latest/single_source_version.html>.
    """
    version_file = read(path)
    version_match = re.search(r"""^__version__ = ['"]([^'"]*)['"]""",
                              version_file, re.M)
    if version_match:
        return version_match.group(1)
    raise RuntimeError("Unable to find version string.")


LONG_DESCRIPTION = """
Consecution is an easy-to-use pipeline abstraction inspired by
Apache Storm topologies.
"""

setup(
    name='consecution',
    version=version(os.path.join(file_dir, 'consecution', '__init__.py')),
    author='Rob deCarvalho',
    author_email='unlisted',
    description=('Pipeline Abstraction Library'),
    license='BSD',
    keywords=('pipeline apache storm DAG graph topology ETL'),
    url='https://github.com/robdmc/consecution',
    packages=find_packages(),
    long_description=LONG_DESCRIPTION,
    classifiers=[
        'Environment :: Console',
        'Intended Audience :: Developers',
        'Programming Language :: Python',
        'Programming Language :: Python :: 2',
        'Programming Language :: Python :: 3',
        'Programming Language :: Python :: 2.7',
        'Programming Language :: Python :: 3.5',
        'Topic :: Scientific/Engineering',
    ],
    extras_require={'dev': ['nose', 'coverage', 'mock', 'flake8', 'coveralls']},
    install_requires=['graphviz']
)