Repository: robdmc/consecution
Branch: develop
Commit: c23b4ea20fb7
Files: 29
Total size: 119.8 KB
Directory structure:
gitextract_eotr679u/
├── .coveragerc
├── .gitignore
├── .travis.yml
├── LICENSE
├── README.md
├── consecution/
│ ├── .coverage
│ ├── __init__.py
│ ├── nodes.py
│ ├── pipeline.py
│ ├── tests/
│ │ ├── __init__.py
│ │ ├── nodes_tests.py
│ │ ├── pipeline_tests.py
│ │ ├── testing_helpers.py
│ │ └── utils_tests.py
│ └── utils.py
├── docker/
│ ├── Dockerfile
│ ├── docker_build.sh
│ ├── docker_run.sh
│ └── simple_example.py
├── docs/
│ ├── Makefile
│ ├── conf.py
│ ├── index.rst
│ ├── ref/
│ │ └── consecution.rst
│ └── toc.rst
├── pandashells.md
├── publish.py
├── sample_data.csv
├── setup.cfg
└── setup.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .coveragerc
================================================
[report]
show_missing = True
================================================
FILE: .gitignore
================================================
.DS_Store
*.pyc
================================================
FILE: .travis.yml
================================================
sudo: false
language: python
python:
- '2.7'
- '3.4'
- '3.5'
- '3.6'
- '3.7'
install:
- pip install -e .[dev]
before_script:
- flake8 .
script:
- nosetests
- coverage report --fail-under=100
after_success:
- coveralls
notifications:
email: false
addons:
apt:
packages:
- graphviz
================================================
FILE: LICENSE
================================================
Copyright (c) 2015, Robert deCarvalho
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
The views and conclusions contained in the software and documentation are those
of the authors and should not be interpreted as representing official policies,
either expressed or implied, of the FreeBSD Project.
================================================
FILE: README.md
================================================
Update (2/23/2021)
===
It looks like this README is slowly turning into a reference of all the projects in this space that I think are better than consecution.
Here is [metaflow](https://github.com/Netflix/metaflow), an offering from Netflix.
Update (9/21/2020)
===
Another library that I believe to be better than consecution is the [pypeln](https://cgarciae.github.io/pypeln/) project. The way it allows for a different number of workers on each node of a pipeline is quite nice. Additionally the ability to control whether each node is run using threads, processes, async, or sync is really useful.
Update (5/1/2020)
===
Since writing this, the excellent [streamz](https://streamz.readthedocs.io/en/latest/) package has been created. Streamz
is the project I wish had existed back when I wrote this. It is a much more capable implementation of the of the core
ideas of consecution, and plays nicely with [dask](https://dask.org/) to achieve scale. I have started using streamz in my work in place of consecution.
Consecution
===
[](https://travis-ci.org/robdmc/consecution)
[](https://coveralls.io/github/robdmc/consecution?branch=add_docs)
Introduction
---
Consecution is:
* An easy-to-use pipeline abstraction inspired by <a href="http://storm.apache.org/releases/current/Tutorial.html"> Apache Storm Topologies</a>
* Designed to simplify building ETL pipelines that are robust and easy to test
* A system for wiring together simple processing nodes to form a DAG, which is fed with a python iterable
* Built using synchronous, single-threaded execution strategies designed to run efficiently on a single core
* Implemented in pure-python with optional requirements that are needed only for graph visualization
* Written with 100% test coverage
Consecution makes it easy to build systems like this.

Installation
---
Consecution is a pure-python package that is simply installed with pip. The only non-essential
requirement is the
<a href="http://www.graphviz.org/">Graphviz</a> system package, which is only needed if you want to create
graphical representations of your pipeline.
<pre><code><strong>[~]$ pip install consecution</strong></code></pre>
Docker
---
If you would like to try out consecution on docker, check out consecution from github and navigate to the
`docker/` subdirectory. From there, run the following.
* Build the consecution image: `docker_build.sh`
* Start a container: `docker_run.sh`
* Once in the container, run the example: `python simple_example.py`
Quick Start
---
What follows is a quick tour of consecution. See the <a
href="http://consecution.readthedocs.io/en/latest/">API documentation</a> for
more detailed information.
### Nodes
Consecution works by wiring together nodes. You create nodes by inheriting from the
`consecution.Node` class. Every node must define a `.process()` method. This method
contains whatever logic you want for processing single items as they pass through your
pipeline. Here is an example of a node that simply logs items passing through it.
```python
from consecution import Node
class LogNode(Node):
def process(self, item):
# any logic you want for processing single item
print('{: >15} processing {}'.format(self.name, item))
# send item downstream
self.push(item)
```
### Pipelines
Now let's create a pipeline that wires together a series of these logging nodes.
We do this by employing the pipe symbol in much the same way that you pipe data
between programs in unix. Note that you must name nodes when you instantiate
them.
```python
from consecution import Node, Pipeline
# This is the same node class we defined above
class LogNode(Node):
def process(self, item):
print('{} processing {}'.format(self.name, item))
self.push(item)
# Connect nodes with pipe symbols to create pipeline for consuming any iterable.
pipe = Pipeline(
LogNode('extract') | LogNode('transform') | LogNode('load')
)
```
At this point, we can visualize the pipeline to verify that the topology is
what we expect it to be. If you Graphviz installed, you can now simply type
one of the following to see the pipeline visualized.
```python
# Create a pipeline.png file in your working directory
pipe.plot()
# Interactively display the pipeline visualization in an IPython notebook
# by simply making the final expression in a cell evaluate to a pipeline.
pipe
```
The plot command should produce the following visualization.

If you don't have Graphviz installed, you can print the pipeline
object to get a text-based visualization.
```python
print(pipe)
```
This represents your pipeline as a series of pipe statements showing
how data is piped between nodes.
```
Pipeline
--------------------------------------------------------------------
extract | transform
transform | load
--------------------------------------------------------------------
```
We can now process an iterable with our pipeline by running
```python
pipe.consume(range(5))
```
which will print the following to the console.
```
extract processing 0
transform processing 0
load processing 0
extract processing 1
transform processing 1
load processing 1
extract processing 2
transform processing 2
load processing 2
```
### Broadcasting
Piping the output of a single node into a list of nodes will cause the single
node to broadcast its pushed items to every item in the list. So, again, using
our logging node, we could construct a pipeline like this:
```python
from consecution import Node, Pipeline
class LogNode(Node):
def process(self, item):
print('{} processing {}'.format(self.name, item))
self.push(item)
# pipe to a list of nodes to broadcast items
pipe = Pipeline(
LogNode('extract')
| LogNode('transform')
| [LogNode('load_redis'), LogNode('load_postgres'), LogNode('load_mongo')]
)
pipe.plot()
pipe.consume(range(2))
```
The plot command produces this visualization

and consuming `range(2)` produces this output
```
extract processing 0
transform processing 0
load_redis processing 0
load_postgres processing 0
load_mongo processing 0
extract processing 1
transform processing 1
load_redis processing 1
load_postgres processing 1
load_mongo processing 1
```
### Routing
If you pipe to a list that contains multiple nodes and a single callable, then
consecution will interpret the callable as a routing function that accepts a
single item as its only argument and returns the name of one of the nodes in the
list. The routing function will direct the flow of items as illustrated below.
```python
from consecution import Node, Pipeline
class LogNode(Node):
def process(self, item):
print('{: >15} processing {}'.format(self.name, item))
self.push(item)
def parity(item):
if item % 2 == 0:
return 'transform_even'
else:
return 'transform_odd'
# pipe to a list containing a callable to achieve routing behaviour
pipe = Pipeline(
LogNode('extract')
| [LogNode('transform_even'), LogNode('transform_odd'), parity]
)
pipe.plot()
pipe.consume(range(4))
```
The plot command produces the following pipeline

and consuming `range(4)` produces this output
```
extract processing 0
transform_even processing 0
extract processing 1
transform_odd processing 1
extract processing 2
transform_even processing 2
extract processing 3
transform_odd processing 3
```
### Merging
Up to this point, we have the ability to create processing trees where nodes
can either broadcast to or route between their downstream nodes. We can,
however, do more then this and create DAGs (Directed-Acyclic-Graphs). Piping
from a list back to a single node will merge the output of all nodes in the
list together into the single downstream node like this.
```python
from consecution import Node, Pipeline
class LogNode(Node):
def process(self, item):
print('{: >15} processing {}'.format(self.name, item))
self.push(item)
def parity(item):
if item % 2 == 0:
return 'transform_even'
else:
return 'transform_odd'
# piping from a list back to a single node merges items into downstream node
pipe = Pipeline(
LogNode('extract')
| [LogNode('transform_even'), LogNode('transform_odd'), parity]
| LogNode('load')
)
pipe.plot()
pipe.consume(range(4))
```
The plot command produces the following pipeline

and consuming `range(4)` produces this output
```
extract processing 0
transform_even processing 0
load processing 0
extract processing 1
transform_odd processing 1
load processing 1
extract processing 2
transform_even processing 2
load processing 2
extract processing 3
transform_odd processing 3
load processing 3
```
### Managing Local State
Nodes are classes, and as such, you have the freedom to create any attribute you
want on a node. You can actually define two additional methods on your nodes to
set up and tear down node-local state. It is important to note the order of
execution here. All nodes in a pipeline will execute their `.begin()` methods
in pipeline-order before any items are processed. Each node will enter its
`.end()` method only after it has processed all items, and after all parent
nodes have finished their respective `.end()` methods. Below, we've modified
our LogNode to keep a running sum of all items that pass through it and end by
printing their sum.
```python
from consecution import Node, Pipeline
class LogNode(Node):
def begin(self):
self.sum = 0
print('{}.begin()'.format(self.name))
def process(self, item):
print('{: >15} processing {}'.format(self.name, item))
self.sum += item
self.push(item)
def end(self):
print('sum = {:d} in {}.end()'.format(self.sum, self.name))
# Identical pipeline to merge example above, but with modified LogNode
pipe = Pipeline(
LogNode('extract')
| [LogNode('transform_even'), LogNode('transform_odd'), parity]
| LogNode('load')
)
pipe.consume(range(4))
```
Consuming `range(4)` produces the following output
```
extract.begin()
transform_even.begin()
transform_odd.begin()
load.begin()
extract processing 0
transform_even processing 0
load processing 0
extract processing 1
transform_odd processing 1
load processing 1
extract processing 2
transform_even processing 2
load processing 2
extract processing 3
transform_odd processing 3
load processing 3
sum = 6 in extract.end()
sum = 2 in transform_even.end()
sum = 4 in transform_odd.end()
sum = 6 in load.end()
```
### Managing Global State
Every node object has a `.global_state` attribute that is shared globally across
all nodes in the pipeline. The attribute is also available on the Pipeline
object itself. The GlobalState object is a simple mutable python object whose
attributes can be mutated by any node. It also remains accesible on the
Pipeline object after all nodes have completed. Below is a simple example of
mutating and accessing global state.
```python
from consecution import Node, Pipeline, GlobalState
class LogNode(Node):
def process(self, item):
self.global_state.messages.append(
'{: >15} processing {}'.format(self.name, item)
)
self.push(item)
# create a global state object with a messages attribute
global_state = GlobalState(messages=[])
# Assign the predefined global_state to the pipeline
pipe = Pipeline(
LogNode('extract') | LogNode('transform') | LogNode('load'),
global_state=global_state)
)
pipe.consume(range(3))
# print the content of the global state message list
for msg in pipe.global_state.messages:
print msg
```
Printing the contents of the messages list produces
```
extract processing 0
transform processing 0
load processing 0
extract processing 1
transform processing 1
load processing 1
extract processing 2
transform processing 2
load processing 2
```
## Common Patterns
This section shows examples of how to implement some common patterns in
consecution.
### Map
Mapping with nodes is very simple. Just push an altered item downstream.
```python
from consecution import Node, Pipeline
class Mapper(Node):
def process(self, item):
self.push(2 * item)
class LogNode(Node):
def process(self, item):
print('{: >15} processing {}'.format(self.name, item))
self.push(item)
pipe = Pipeline(
LogNode('extractor') | Mapper('mapper') | LogNode('loader')
)
pipe.consume(range(3))
```
This will produce an output of
```
extractor processing 0
loader processing 0
extractor processing 1
loader processing 2
extractor processing 2
loader processing 4
```
### Reduce
Reducing, or folding, is easily implemented by using the `.begin()`
and `.end()` methods to handle accumulated values.
```python
from consecution import Node, Pipeline
class Reducer(Node):
def begin(self):
self.result = 0
def process(self, item):
self.result += item
def end(self):
self.push(self.result)
class LogNode(Node):
def process(self, item):
print('{: >15} processing {}'.format(self.name, item))
self.push(item)
pipe = Pipeline(
LogNode('extractor') | Reducer('reducer') | LogNode('loader')
)
pipe.consume(range(3))
```
This will produce an output of
```
extractor processing 0
extractor processing 1
extractor processing 2
loader processing 3
```
### Filter
Filtering is as simple as placing the push statement behind a conditional. All
items that don't pass the conditional will not be pushed downstream, and thus
silently dropped.
```python
from consecution import Node, Pipeline
class Filter(Node):
def process(self, item):
if item > 3:
self.push(item)
class LogNode(Node):
def process(self, item):
print('{: >15} processing {}'.format(self.name, item))
self.push(item)
pipe = Pipeline(
LogNode('extractor') | Filter('filter') | LogNode('loader')
)
pipe.consume(range(6))
```
This produces an output of
```
extractor processing 0
extractor processing 1
extractor processing 2
extractor processing 3
extractor processing 4
loader processing 4
extractor processing 5
loader processing 5
```
### Group By
Consecution provides a specialized class you can inherit from to perform
grouping operations. GroupBy nodes must define two methods: `.key(item)` and
`.process(batch)`. The `.key` method should return a key from an item that is used
to identify groups. Any time that key changes, a new group is initiated. Like
Python's `itertools.groupby`, you will usually want the GroupByNode to process
sorted items. The `.process` method functions exactly like the `.process`
method on regular nodes, except that instead of being called with items,
consecution will call it with a batch of items contained in a list.
```python
class LogNode(Node):
def process(self, item):
print('{: >15} processing {}'.format(self.name, item))
self.push(item)
class Batcher(GroupByNode):
def key(self, item):
return item // 4
def process(self, batch):
sum_val = sum(batch)
self.push(sum_val)
pipe = Pipeline(
Batcher('batcher') | LogNode('logger')
)
pipe.consume(range(16))
```
This produces an output of
```
logger processing 6
logger processing 22
logger processing 38
logger processing 54
```
### Plugin-Style Composition
Consecution forces you to think about problems in terms of how small processing
units are connected. This separation between logic and connectivity can be
exploited to create flexible and reusable solutions. Basically, you specify the
connectivity you want to use in solving your problem, and then plug in the
processing units later. Breaking the problem up in this way allows you to swap
out processing units to acheive different objectives with the same pipeline.
```python
# This function defines a pipeline that can use swappable processing nodes.
# We don't worry about how we are going to do logging or aggregating.
# We just focus on how the nodes are connected.
def pipeline_factory(log_node, agg_node):
pipe = Pipeline(
log_node('extractor') | agg_node('aggregator') | log_node('result_logger')
)
return pipe
# Now we define a node for left-justified logging
class LeftLogNode(Node):
def process(self, item):
print('{: <15} processing {}'.format(self.name, item))
self.push(item)
# And one for right-justified logging
class RightLogNode(Node):
def process(self, item):
print('{: >15} processing {}'.format(self.name, item))
self.push(item)
# We can aggregate by summing
class SumNode(Node):
def begin(self):
self.result = 0
def process(self, item):
self.result += item
def end(self):
self.push(self.result)
# Or we can aggregate by multiplying
class ProdNode(Node):
def begin(self):
self.result = 1
def process(self, item):
self.result *= item
def end(self):
self.push(self.result)
# Now we plug in nodes to create a pipeline that left-prints sums
sum_pipeline = pipeline_factory(log_node=LeftLogNode, agg_node=SumNode)
# And a different pipeline that right prints products
prod_pipeline = pipeline_factory(log_node=RightLogNode, agg_node=ProdNode)
print 'aggregate with sum, left justified\n' + '-'*40
sum_pipeline.consume(range(1, 5))
print '\naggregate with product, right justified\n' + '-'*40
prod_pipeline.consume(range(1, 5))
```
This produces the following output
```
aggregate with sum, left justified
----------------------------------------
extractor processing 1
extractor processing 2
extractor processing 3
extractor processing 4
result_logger processing 10
aggregate with product, right justified
----------------------------------------
extractor processing 1
extractor processing 2
extractor processing 3
extractor processing 4
result_logger processing 24
```
# Aggregation Example
We end with a full-blown example of using a pipeline to aggregate data from a
csv file. The data is contained in
<a href="https://raw.githubusercontent.com/robdmc/consecution/master/sample_data.csv">
a csv file </a> that looks like this.
gender |age |spent
--- |--- |---
male |11 |39.39
female |10 |34.72
female |15 |40.02
male |19 |26.27
male |13 |21.22
female |40 |23.17
female |52 |33.42
male |33 |39.52
female |16 |28.65
male |60 |26.74
Although there are much simpler ways of solving this problem, (e.g. with <a
href="https://github.com/robdmc/consecution/blob/master/pandashells.md">
Pandashells</a>)
we deliberately construct a complex topology just to illustrate how to achieve
complexity when it is actually needed.
The diagram below was produced from the code beneath it. A quick glance at the
diagram makes it obvious how the data is being routed through the system. The
code is heavily commented to explain features of the consecution toolkit.

```python
from __future__ import print_function
from collections import namedtuple
from pprint import pprint
import csv
from consecution import Node, Pipeline, GlobalState
# Named tuples are nice immutable containers
# for passing data between nodes
Person = namedtuple('Person', 'gender age spent')
# Create a pipeline that aggregates by gender and age
# In creating the pipeline we focus on connectivity and don't
# worry about defining node behavior.
def pipe_factory(Extractor, Agg, gender_router, age_router):
# Consecution provides a generic GlobalState class. Any object can be used
# as the global_state in a pipeline, but the GlobalState object provides a
# nice abstraction where attributes can be accessed either by dot notation
# (e.g. global_state.my_attribute) or by dictionary notation (e.g.
# global_state['my_attribute']. Furthermore, GlobalState objects can be
# instantiated with initialized attributes using key-word arguments as shown
# here.
global_state = GlobalState(segment_totals={})
# Notice, we haven't even defined the behavior of these nodes yet. They
# will be defined later and are, for now, just passed into the factory
# function as arguments while we focus on getting the topology right.
pipe = Pipeline(
Extractor('make_person') |
[
gender_router,
(Agg('male') | [age_router, Agg('male_child'), Agg('male_adult')]),
(Agg('female') | [age_router, Agg('female_child'), Agg('female_adult')]),
],
global_state=global_state
)
# Nodes can be created outside of a pipeline definition
adult = Agg('adult')
child = Agg('child')
total = Agg('total')
# Sometimes the topology you want to create cannot easily be expressed
# using the pipeline abstraction for wiring nodes together. You can
# drop down to a lower level of abstraction by explicitly wiring nodes
# together using the .add_downstream() method.
adult.add_downstream(total)
child.add_downstream(total)
# Once a pipeline has been created, you can access individual nodes
# with dictionary-like indexing on the pipeline.
pipe['male_child'].add_downstream(child)
pipe['female_child'].add_downstream(child)
pipe['male_adult'].add_downstream(adult)
pipe['female_adult'].add_downstream(adult)
return pipe
# Now that we have the topology of our pipeline defined, we can think about the
# logic that needs to go into each node. We start by defining a node that takes
# a row from a csv file and tranforms it into a namedtuple.
class MakePerson(Node):
def process(self, item):
item['age'] = int(item['age'])
item['spent'] = float(item['spent'])
self.push(Person(**item))
# We now define a node to perform our aggregations. Mutable global state comes
# with a lot of baggage and should be used with care. This node illustrates
# how to use global state to put all aggregations in a central location that
# remains accessible when the pipeline finishes processing.
class Sum(Node):
def begin(self):
# initialize the node-local sum to zero
self.total = 0
def process(self, item):
# increment the node-local total and push the item down stream
self.total += item.spent
self.push(item)
def end(self):
# when pipeline is done, update global state with sum
self.global_state.segment_totals[self.name] = round(self.total, 2)
# This function routes tuples based on their associated gender
def by_gender(item):
return '{}'.format(item.gender)
# This function routes tuples based on whether the purchaser was an adult or
# child
def by_age(item):
if item.age >= 18:
return '{}_adult'.format(item.gender)
else:
return '{}_child'.format(item.gender)
# Here we plug our node definitions into our topology to create a fully-defined
# pipeline.
pipe = pipe_factory(MakePerson, Sum, by_gender, by_age)
# We can now visualize pipeline.
pipe.plot()
# Now we feed our pipeline with rows from the csv file
with open('sample_data.csv') as f:
pipe.consume(csv.DictReader(f))
# The global_state is also available as an attribute on the pipeline allowing
# us to access it when the pipeline is finished. This is a good way to "return"
# an object from a pipeline. Here we simply print the result.
print()
pprint(pipe.global_state.segment_totals)
```
And this is the result of running the pipeline with the sample csv file.
```
{'adult': 149.12,
'child': 164.0,
'female': 159.98,
'female_adult': 56.59,
'female_child': 103.39,
'male': 153.14,
'male_adult': 92.53,
'male_child': 60.61,
'total': 313.12}
```
As illustrated in the <a
href="https://github.com/robdmc/consecution/blob/master/pandashells.md">
Pandashells</a> example, this aggregation is actually much more simple to
implement in Pandas. However, there are a couple of important caveats.
The Pandas solution must load the entire csv file into memory at once. If you
look at the pipeline solution, you will notice that each node simply increments
its local sum and passes the data downstream. At no point is the data
completely loaded into memory. Although the Pandas code runs much faster due to
the highly optimized vectorized math it employes, the pipeline solution can
process arbitrarily large csv files with a very small memory footprint.
Perhaps the most exciting aspect of consecution is its ability to create
repeatable and testable data analysis pipelines. Passing Pandas Dataframes
through a consecution pipeline makes it very easy to encapsulate any analysis
into a well-defined, repeatable process where each node manipulates a dataframe
in its prescribed way. Adopting this structure in analysis projects will
undoubtedly ease the transition from analysis/research into production.
___
Projects by [robdmc](https://www.linkedin.com/in/robdecarvalho).
* [Pandashells](https://github.com/robdmc/pandashells) Pandas at the bash command line
* [Consecution](https://github.com/robdmc/consecution) Pipeline abstraction for Python
* [Behold](https://github.com/robdmc/behold) Helping debug large Python projects
* [Crontabs](https://github.com/robdmc/crontabs) Simple scheduling library for Python scripts
* [Switchenv](https://github.com/robdmc/switchenv) Manager for bash environments
* [Gistfinder](https://github.com/robdmc/gistfinder) Fuzzy-search your gists
================================================
FILE: consecution/__init__.py
================================================
# flake8: noqa
from consecution.nodes import Node, GroupByNode
from consecution.pipeline import Pipeline, GlobalState
from consecution.utils import Clock
__version__ = '0.2.0'
================================================
FILE: consecution/nodes.py
================================================
import sys
from collections import Counter, deque, OrderedDict
import traceback
from consecution.utils import Clock
class Node(object):
"""
:type name: str
:param str: The name of this node. Must be unique within a pipeline.
:type kwargs: keyword args
:param kwargs: Any additional keyword args are assigned as attributes
on the node.
You create nodes by inheriting from this class. You will be required to
implement a `.process()` on your class. You can call the `.push()` method
from anywhere in your class implementation except from within the
`.begin()` method.
Note that although this documentation refers to "the `.push` method",
`push` is actually a callable attribute assigned when nodes are placed
into pipelines.
Its signature is `.push(item)`, where `item` can be anything you want pushed
to nodes connected to the downstream side of the node.
"""
def __init__(self, name, **kwargs):
# assign any user-defined attributes
for k, v in kwargs.items():
setattr(self, k, v)
self.name = name
self._upstream_nodes = []
self._downstream_nodes = []
self._num_top_down_calls = 0
# node network can be visualized with pydot. These hold args and kwargs
# that will be used to add and connect this node in the graph visualization
self._pydot_node_kwargs = dict(name=self.name, shape='rectangle')
self._pydot_edge_kwarg_list = []
self._router = None
# this will be one of three values: None, 'input', 'output'
self._logging = None
# add a clock to allow for timing
self.clock = Clock()
def __str__(self):
return 'N({})'.format(self.name)
def __repr__(self):
return self.__str__()
def __hash__(self):
"""
define __hash__ method. dicts and sets will use this as key
"""
return id(self)
def __eq__(self, other):
return self.__hash__() == other.__hash__()
def __lt__(self, other):
"""
I need this to be able to sort by name
"""
return self.name < other.name
def __getitem__(self, key):
msg = (
'\n\nYou cannot call __getitem__ on nodes. You tried to call\n'
'{self} [{key}]\n'
'which doesn\'t make sense. You probably meant\n'
'{self} | [{key}]\n'
).format(self=self, key=key)
raise ValueError(msg)
def _get_flattened_list(self, obj):
if isinstance(obj, Node):
return [obj]
elif hasattr(obj, '__iter__'):
nodes = []
for el in obj:
if isinstance(el, Node):
nodes.append(el)
elif hasattr(el, '__iter__'):
nodes.extend(self._get_flattened_list(el))
return nodes
else:
msg = (
'Don\'t know what to do with {}. It\'s not a node, and it\'s '
'not iterable.'
).format(repr(obj))
raise ValueError(msg)
def _get_exposed_slots(self, obj, pointing):
nodes = set()
for node in self._get_flattened_list(obj):
if pointing == 'left':
nodes = nodes.union(node.initial_node_set)
elif pointing == 'right':
nodes = nodes.union(node.terminal_node_set)
else:
raise ValueError('pointing must be "left" or "right"')
return nodes
def _connect_lefts_to_rights(self, lefts, rights, router=None):
slots_from_left = self._get_exposed_slots(lefts, pointing='right')
slots_from_right = self._get_exposed_slots(rights, pointing='left')
for left in slots_from_left:
router_node = None
if router:
router_name = '{}.{}'.format(
left.name, self._get_object_name(router))
end_point_map = {n.name: n for n in slots_from_right}
router_node = _RouterNode(
router_name, end_point_map, router)
left.add_downstream(router_node)
for right in slots_from_right:
if router_node:
router_node.add_downstream(right)
else:
left.add_downstream(right)
def _get_object_name(self, obj):
class_name = obj.__class__.__name__
if class_name == 'function':
return obj.__name__
else:
return class_name
def _get_router(self, obj):
router = None
if hasattr(obj, '__iter__'):
routers = [el for el in obj if hasattr(el, '__call__')]
router = routers[0] if routers else None
return router
def __or__(self, other):
router = self._get_router(other)
self._connect_lefts_to_rights(self, other, router)
return self
def __ror__(self, other):
self._connect_lefts_to_rights(other, self)
return self
@property
def top_node(self):
"""
This attribute always holds the top-most node in the node graph.
Consecution only allows one top node.
"""
root_nodes = self.root_nodes
if len(root_nodes) > 1:
msg = 'You must remove one of the following input nodes {}'.format(
root_nodes)
raise ValueError(msg)
else:
return root_nodes.pop()
@property
def terminal_node_set(self):
"""
This attribute holds a set of all bottom nodes in the node graph.
"""
return {
node for node in self.depth_first_walk('down')
if len(node._downstream_nodes) == 0
}
@property
def initial_node_set(self):
"""
When piecing together fragments of a graph, you can temporarily have
connected nodes with multiple "top-nodes." This method returns this
set of nodes. Node that consecution can only make pipelines from
graphs having a single top node.
"""
self.depth_first_walk('up')
return {
node for node in self.depth_first_walk('up')
if len(node._upstream_nodes) == 0
}
@property
def root_nodes(self):
"""
This attribute holds a list of all nodes that do not have any upstream
nodes attached.
"""
return [
node for node in self.all_nodes
if len(node._upstream_nodes) == 0
]
@property
def all_nodes(self):
"""
This attribute contains a set of all nodes in the graph.
"""
return self.depth_first_walk('both')
def log(self, what):
"""
Calling this method on a node will turn on its logging feature. This
means that the node will print logged items to the console. You can
choose whether to log the inputs or outputs of a node.
:type name: what
:param what: One of 'input' or 'output' indicating whther you want to
log the input or output of this node.
"""
allowed = ['input', 'output']
if what not in allowed:
raise ValueError(
'\'what\' argument must be in {}'.format(allowed)
)
self._logging = what
def _get_downstream_reps(self):
if self._downstream_nodes:
downstreams = sorted([n.name for n in self._downstream_nodes])
if len(downstreams) == 1:
downstreams = downstreams[0]
template = '{{: >{}s}} | {{}}\n'.format(
self.pipeline._longest_node_name_len_)
self.pipeline._node_repr += template.format(
self.name, downstreams).replace('\'', '')
def top_down_make_repr(self):
"""
You should never need to use this method. It iterates through the node
graph in top-down order making a repr string for each node.
"""
if not hasattr(self, 'pipeline'):
raise ValueError(
'top_down_make_repr can only be called for nodes in a pipeline')
self.pipeline._longest_node_name_len_ = max(
len(n.name) for n in self.all_nodes)
self.pipeline._node_repr = ''
self.top_node.top_down_call('_get_downstream_reps')
def top_down_call(self, method_name):
"""
This utility method traverses the graph in top-down order and invokes
the named method on every node it encounters. It is used internally
to make sure the `.begin()` and `.end()` methods are not called before
their upstream counterparts.
:type method_name: str
:param method_name: The name of the method you would like to call in
top-down order.
"""
# record the number of upstreams this node has
num_upstreams = len(self._upstream_nodes)
# if this node isn't pulling from multiple upstreams, it's ready
# to recurse to downstreams
if num_upstreams <= 1:
ready_for_downstreams = True
# this node isn't ready to recurse to downstreams until the current
# call would mean the last required call.
elif self._num_top_down_calls == num_upstreams - 1:
ready_for_downstreams = True
else:
ready_for_downstreams = False
# if ready to recurse, then call the method on self and recurse
# downwards.
if ready_for_downstreams:
getattr(self, method_name)()
for downstream in self._downstream_nodes:
downstream.top_down_call(method_name)
self._num_top_down_calls = 0
else:
self._num_top_down_calls += 1
def depth_first_walk(self, direction='both', as_ordered_list=False):
"""
This method walks the graph of connected nodes in depth-first
order. It uses a stack to emulate recursion. See good explanation at
https://jeremykun.com/2013/01/22/depth-and-breadth-first-search/
:type direction: str
:param direction: one of 'up', 'down' or 'both' specifying the direction
to walk.
:type as_ordered_list: Bool
:param as_ordered_list: If set to true, returns the walked nodes as
an ordered list instead of an unordered set.
:rtype: list or set
:return: An iterable of the discovered nodes.
"""
return self.walk(
direction=direction, how='depth_first',
as_ordered_list=as_ordered_list)
def breadth_first_walk(self, direction='both', as_ordered_list=False):
"""
This method walks the graph of connected nodes in breadth-first
order. It uses a stack to emulate recursion. See good explanation at
https://jeremykun.com/2013/01/22/depth-and-breadth-first-search/
:type direction: str
:param direction: one of 'up', 'down' or 'both' specifying the direction
to walk.
:type as_ordered_list: Bool
:param as_ordered_list: If set to true, returns the walked nodes as
an ordered list instead of an unordered set.
:rtype: list or set
:return: An iterable of the discovered nodes.
"""
return self.walk(
direction=direction, how='breadth_first',
as_ordered_list=as_ordered_list)
def walk(
self, direction='both', how='breadth_first', as_ordered_list=False):
"""
This is the core algorithm for walking a graph in specified order. It
is used by the `breadth_first_walk` and `depth_first_walk` methods.
:type how: str
:param how: one of 'breadth_first' or 'depth_first'
:type direction: str
:param direction: one of 'up', 'down' or 'both' specifying the direction
to walk.
:type as_ordered_list: Bool
:param as_ordered_list: If set to true, returns the walked nodes as
an ordered list instead of an unordered set.
:rtype: list or set
:return: An iterable of the discovered nodes.
"""
if how not in {'depth_first', 'breadth_first'}:
raise ValueError(
'\'how\' argument must be one of '
'[\'depth_first\', \'breadth_first\']'
)
# What I really want is an ordered set, which doesn't exist. So I'm
# using the keys of an ordered dict to get the functionality I want.
# I have no need for the values in this dict, only the keys.
visited_nodes = OrderedDict()
# holds nodes that still need to be explored
queue = deque([self])
# while I still have nodes that need exploring
while len(queue) > 0:
# get the next node to explore
node = queue.pop()
# if I've already seen this node, nothing to do, so go to next
if node in visited_nodes:
continue
# Make sure I don't visit this node again
# again. I'm using an ordered dict to mimic an ordered set.
# I have no need for the value, so set it to None
visited_nodes[node] = None
neighbor_dict = {
'up': node._upstream_nodes,
'down': node._downstream_nodes,
'both': node._upstream_nodes + node._downstream_nodes,
}
if direction not in neighbor_dict:
raise ValueError(
'direction must be \'up\', \'dowwn\' or \'both\'')
neighbors = neighbor_dict[direction]
# search all neightbors to this node for unvisited nodes
for node in neighbors:
# if you find unvisited node, add it to nodes needing visit
if node not in visited_nodes:
if how == 'breadth_first':
queue.appendleft(node)
else:
queue.append(node)
# should have hit all nodes in the graph at this point
if as_ordered_list:
return list(visited_nodes.keys())
else:
return set(visited_nodes.keys())
def _check_for_dups(self):
counter = Counter()
for node in self.all_nodes:
counter.update({node.name: 1})
dups = [name for (name, count) in counter.items() if count > 1]
if dups:
msg = (
'\n\nNode names must be unique. Dupicates {} found.'
).format(list(dups))
raise ValueError(msg)
return
def _check_for_cycles(self):
self_and_upstreams = self.depth_first_walk('up')
downstreams = self.depth_first_walk('down') - {self}
common_nodes = self_and_upstreams.intersection(downstreams)
if common_nodes:
raise ValueError('\n\nYour graph is not acyclic. It has loops.')
def _validate_node(self, other):
# only nodes allowed to be connected
if not isinstance(other, Node):
raise ValueError('Trying to connect a non-node type')
def add_downstream(self, other):
"""
You will probably use this method quite a bit. It is used to manually
attach a downstream node.
:type other: consecution.Node
:param other: An instance of the node you want to attach
"""
self._validate_node(other)
self._downstream_nodes.append(other)
other._upstream_nodes.append(self)
self._check_for_dups()
if self.name == other.name:
raise ValueError('{} can\'t be downstream to itself'.format(self))
self._check_for_cycles()
self._pydot_edge_kwarg_list.append(
dict(tail_name=self.name, head_name=other.name))
def remove_downstream(self, other):
"""
This method removes the given node from being attached as a downstream
node.
:type other: consecution.Node
:param other: An instance of the node you want to remove
"""
# remove self from the other's upstreams
other._upstream_nodes = [
n for n in other._upstream_nodes if n.name != self.name]
# remove other from self's downstream nodes
self._downstream_nodes = [
n for n in self._downstream_nodes if n.name != other.name]
# remove this connection from the pydot kwargs list
new_kwargs_list = []
for kwargs in self._pydot_edge_kwarg_list:
if kwargs['head_name'] == other.name:
continue
new_kwargs_list.append(kwargs)
self._pydot_edge_kwarg_list = new_kwargs_list
def _build_pydot_graph(self):
"""
This private method builds a pydot graph
"""
# define kwargs lists for creating the visualization (these are closure vars for function below)
node_kwargs_list, edge_kwargs_list = [], []
# define a function to map over all nodes to aggreate viz kwargs
def collect_kwargs(node):
node_kwargs_list.append(node._pydot_node_kwargs)
edge_kwargs_list.extend(node._pydot_edge_kwarg_list)
for node in self.all_nodes:
collect_kwargs(node)
# doing import inside method so that pydot dependency is optional
from graphviz import Digraph
# create a pydot graph
graph = Digraph(comment='pipeline')
# create pydot nodes for every node connected to this one
for node_kwargs in node_kwargs_list:
graph.node(**node_kwargs)
# creat pydot edges between all nodes connected to this one
for edge_kwargs in edge_kwargs_list:
graph.edge(**edge_kwargs)
return graph
def plot(
self, file_name='pipeline', kind='png'):
"""
This method draws a visualization of your processing graph. You must
have graphviz installed on your system for it to work properly. (See
install instructions.)
If you are running consecution in an Jupyter notebook, you can display
an inline visualization of a pipeline by simply making the pipeline be
the final expression in a cell.
:type file_name: str
:param file_name: The name of the image file to generate
:type kind: str
:param kind: The kind of file to generate (png, pdf)
"""
graph = self._build_pydot_graph()
# define allowed formats for saving the graph visualization
ALLOWED_KINDS = {'pdf', 'png'}
if kind not in ALLOWED_KINDS:
raise ValueError('Only the following kinds are supported: {}'.format(ALLOWED_KINDS))
# set the output format
graph.format = kind
file_name = file_name.replace('.{}'.format(kind), '')
# write the output file
try:
graph.render(file_name)
except RuntimeError:
sys.stderr.write(
'\n\n'
'=========================================================\n'
'Problem executing GraphViz. Make sure you have it\n'
'properly installed.\n'
'http://www.graphviz.org/\n'
'If you are on a mac, you should be able to install it with\n'
'brew install graphviz.\n\n'
'If you are on ubuntu, you can install it with\n'
'apt-get install graphviz\n'
'=========================================================\n'
'\n\n'
)
raise
def process(self, item):
"""
:type item: object
:param item: The item this node should process
You must override this method with your own logic.
"""
raise NotImplementedError(
(
'Error in node named {}\n'
'You must define a .process(self, item) method on all nodes'
).format(repr(self.name))
)
def reset(self):
"""
User can override this to do whatever logic they want.
"""
def _logged_process(self, item):
if self._logging == 'input':
self._write_log(item)
self.process(item)
def _begin(self):
try:
self.begin()
except AttributeError:
e = sys.exc_info()[1]
tb = sys.exc_info()[2]
(
code_file, line_no, method_name, line_txt
) = traceback.extract_tb(tb)[-1]
msg = str(e) + (
'\n\nError in .begin() method of \'{}\' node.\n'
'Are you trying to call .push() from inside the\n'
'.begin() method? That is not allowed.\n\n'
'file: {}, line{}\n--> {}\n\n'
).format(self.name, code_file, line_no, line_txt)
traceback.print_exc()
raise AttributeError(msg)
def begin(self):
pass
def end(self):
pass
def _write_log(self, item):
sys.stdout.write('node_log,{},{},{}\n'.format(self._logging, self.name, item))
def _push(self, item):
"""
This is the default pusher. It pushes to all downstreams.
"""
if self._logging == 'output':
self._write_log(item)
# The _process attribute will be set to the appropriate callable
# when initializing the pipeline. I do this because I want the
# chaining to be as efficient as possible. If logging is not set,
# I don't want to have to hit that logic every push, so I just
# invoke a callable attribute at each process that has been set
# to the appropriate callable.
for downstream in self._downstream_nodes:
downstream._process(item)
class _RouterNode(Node):
"""
This node will route to downstreams. The router function needs to
return the name of the destination node.
"""
def __init__(self, name, end_point_map, route_callable):
super(_RouterNode, self).__init__(name)
self._end_point_map = end_point_map
self._pydot_node_kwargs = dict(name=self.name, shape='oval')
self._route_callable = route_callable
def process(self, item):
"""
This is the default pusher. It pushes to all downstreams.
"""
node = self._end_point_map.get(self._route_callable(item), None)
if node is None:
raise ValueError(
(
'\n\nRouter node {} encountered bad route path {}. Valid '
'route paths are {}.'
).format(
self.name,
repr(self._route_callable(item)),
[n.name for n in self._downstream_nodes]
)
)
node._process(item)
class GroupByNode(Node):
def __init__(self, *args, **kwargs):
super(GroupByNode, self).__init__(*args, **kwargs)
self._batch_ = []
self._previous_key = '__no_previous_key__'
def key(self, item):
"""
You must define this method.
:type item: object
:param item: The item you are processing
:rtype: hashable object
:return: a hashable object that serves as a key for the grouping process
"""
raise NotImplementedError(
'you must define a .key(self, item) method on all '
'GroupBy nodes.'
)
def process(self, batch):
"""
You must define this method.
:type batch: iterable
:param batch: A batch of items having the same key
"""
raise NotImplementedError(
'You must define a .process(self, batch) method on all GroupBy '
'nodes.'
)
def _process_item(self, item):
key = self.key(item)
if key != self._previous_key:
self._previous_key = key
if len(self._batch_) > 0:
self.process(self._batch_)
self._batch_ = [item]
else:
self._batch_.append(item)
def _end(self):
self.process(self._batch_)
self._batch_ = []
def __getattribute__(self, name):
"""
This should trap for the end() method calls and install
pre hook.
"""
if name == 'end':
def wrapper():
self._end()
return super(GroupByNode, self).__getattribute__(name)()
return wrapper
else:
return super(GroupByNode, self).__getattribute__(name)
================================================
FILE: consecution/pipeline.py
================================================
import sys
from consecution.nodes import GroupByNode
class GlobalState(object):
"""
GlobalState is a simple container class that sets its attributes from
constructor kwargs. It supports both object and dictionary access to its
attributes. So, for example, all of the following statements are supported.
.. code-block:: python
from consecution import GlobalState
global_state = GlobalState(a=1, b=2)
global_state['c'] = 2
a = global_state['a']
An object of this class will be created as the default ``.global_state``
attribute on a Pipeline if you do not explicitely provide a global_state
argument to the constructor.
"""
# I'm using unconventional "_item_self_" name here to avoid
# conflicts when kwargs actually contain a "self" arg.
def __init__(_item_self, **kwargs):
for key, val in kwargs.items():
_item_self[key] = val
def __str__(_item_self):
quoted_keys = [
'\'{}\''.format(k) for k in sorted(vars(_item_self).keys())]
att_string = ', '.join(quoted_keys)
return 'GlobalState({})'.format(att_string)
def __repr__(_item_self):
return _item_self.__str__()
def __setitem__(_item_self, key, value):
setattr(_item_self, key, value)
def __getitem__(_item_self, key):
return getattr(_item_self, key)
class Pipeline(object):
"""
:type node: Node
:param node: Any node in a connected graph
:type global_state: object
:param global_state: Any python object you want to use for holding global
state.
Once Nodes have been wired together, they must be placed in a pipeline in
order to process data. If you would like to peform pipeline-level set up and
tear-down logic, you can subclass from Pipeline and override the
``.begin()`` and ``end()`` methods.
"""
def __init__(self, node, global_state=None):
# get a reference to the top node of the connected nodes supplied.
self.top_node = node.top_node
# set the pipeline global state
if global_state:
self.global_state = global_state
else:
self.global_state = GlobalState()
# initialize an empty lookup for nodes
self._node_lookup = {}
# initialize the pipeline
self.initialize()
def initialize(self, with_push=False):
# define a flag to determine if the pipeline is "running" or not
# it will only be true between when the .begin() is run and the
# .end() method is run.
self._is_running = False
self._needs_log_header = False
# initialize each node
for node in self.top_node.all_nodes:
self.initialize_node(node, with_push)
# build the pipeline repr by cycling through all the nodes
self.top_node.top_down_make_repr()
# print a logging header if any node is logging
if self._needs_log_header:
sys.stdout.write('node_log,what,node_name,item\n')
def initialize_node(self, node, with_push=False):
# give node reference to pipeline attributes
node.pipeline = self
node.global_state = self.global_state
# make node available for lookup
self._node_lookup[node.name] = node
# set the _process callable to be either logged or unlogged
# TODO: might want to change this logic so that groupby nodes
# can be logged
if isinstance(node, GroupByNode):
node._process = node._process_item
elif node._logging is None:
node._process = node.process
else:
self._needs_log_header = True
node._process = node._logged_process
# for single downstreams with no logging, can short-circuit all logic
# and directly wire up the downstream process() callable as the
# push callable on this node
short_it = len(node._downstream_nodes) == 1
short_it = short_it and node._downstream_nodes[0]._logging is None
short_it = short_it and not isinstance(
node._downstream_nodes[0], GroupByNode)
# only initialize push if requsted
if with_push:
if short_it and node._logging is None:
node.push = node._downstream_nodes[0].process
# logged or multiple downstreams require logic, so no short circuit
else:
node.push = node._push
def __getitem__(self, name):
node = self._node_lookup.get(name, None)
if node is None:
raise KeyError('No node named \'{}\''.format(name))
return node
def __setitem__(self, name_to_replace, replacement_node):
# make sure replacement node has proper name
if name_to_replace != replacement_node.name:
raise ValueError(
'Replacement node must have the same name.'
)
# this will automatically raise error if the name doesn't exist
node_to_replace = self[name_to_replace]
removals = []
additions = []
for upstream in node_to_replace._upstream_nodes:
removals.append((upstream, node_to_replace))
additions.append((upstream, replacement_node))
# handle special case of upstream being a routing node
if hasattr(upstream, '_end_point_map'):
upstream._end_point_map[name_to_replace] = replacement_node
for downstream in node_to_replace._downstream_nodes:
removals.append((node_to_replace, downstream))
additions.append((replacement_node, downstream))
for upstream, downstream in removals:
upstream.remove_downstream(downstream)
for upstream, downstream in additions:
upstream.add_downstream(downstream)
# initialize the replacement node within the pipeline
self.initialize_node(replacement_node)
# if top node was replaced then make sure pipeline nows about it
if replacement_node.name == self.top_node.name:
self.top_node = replacement_node
def __getattribute__(self, name):
"""
This should trap for the begin() and end() method calls and install
pre/post hooks for when they are called either on the pipeline
class or on any class derived from it.
"""
if name == 'begin':
def wrapper():
super(Pipeline, self).__getattribute__(name)()
self._begin()
return wrapper
elif name == 'end':
def wrapper():
self._end()
return super(Pipeline, self).__getattribute__(name)()
return wrapper
elif name == 'reset':
def wrapper():
self._reset()
return super(Pipeline, self).__getattribute__(name)()
return wrapper
else:
return super(Pipeline, self).__getattribute__(name)
def begin(self):
"""
Override this method to execute any logic you want to perform before
setting up nodes. The ``.begin()`` method of all nodes will be called.
"""
def end(self):
"""
Override this method to execute any logic you want to perform after
all nodes are done processing data. The ``.end()`` method of all nodes
will be called.
"""
def reset(self):
"""
Override this with any logic you'd like to perform for resetting the
pipeline. The ``.reset()`` method of all nodes will be called.
"""
def _reset(self):
self.top_node.top_down_call('reset')
def _begin(self):
self.top_node.top_down_call('_begin')
self.initialize(with_push=True)
self._is_running = True
def _end(self):
self.top_node.top_down_call('end')
self._is_running = False
def push(self, item):
"""
You can manually push items to your pipeline using this meethod.
:type item: object
:param item: Any object you would like the pipeline to process
"""
if not self._is_running:
self.begin()
self.top_node._process(item)
def consume(self, iterable):
"""
The pipeline will process each item in the iterable.
:type iterable: A Python Iterable
:param iterable: An iterable of objects you would like to process
"""
self.begin()
for item in iterable:
self.top_node._process(item)
return self.end()
def plot(self, file_name='pipeline', kind='png'):
"""
Call this method to produce a visualization of your pipeline. The
Graphviz library will be used to generate the image file. Note that
pipelines are automatically visualized in IPython notebook when they are
evaluated as the last expression in a cell.
:type file_name: str
:param file_name: The name of the image file to save
:type kind: str
:param kind: The type of image file to produce (png, pdf)
"""
self.top_node.plot(file_name, kind)
return self
def __str__(self):
return (
'\nPipeline\n'
'----------------------------------'
'----------------------------------\n{}'
'----------------------------------'
'----------------------------------\n'
).format(self._node_repr)
def __repr__(self):
return self.__str__()
# No good way to test this unless you know dot is installed.
def _repr_svg_(self): # pragma: no cover
return self.top_node._build_pydot_graph()._repr_svg_()
================================================
FILE: consecution/tests/__init__.py
================================================
================================================
FILE: consecution/tests/nodes_tests.py
================================================
import os
from collections import namedtuple
import shutil
import tempfile
from unittest import TestCase
import subprocess
from mock import patch
from consecution.nodes import Node
def dot_installed():
p = subprocess.Popen(
['bash', '-c', 'which dot'], stdout=subprocess.PIPE)
p.wait()
result = p.stdout.read().decode("utf-8")
return 'dot' in result
class FakeDigraph(object): # pragma: no cover
def __init__(self, *args, **kwargs):
pass
def node(self, *args, **kwargs):
pass
def edge(self, *args, **kwargs):
pass
def render(self, *args, **kwargs):
raise RuntimeError('fake runtime error')
class NodeUnitTests(TestCase):
def test_bad_logging_args(self):
n = Node('a')
with self.assertRaises(ValueError):
n.log('bad')
def test_bad_top_down_make_repr_call(self):
n = Node('a')
with self.assertRaises(ValueError):
n.top_down_make_repr()
def test_args_as_atts(self):
n = Node('my_node', silly_attribute='silly')
self.assertEqual(n.silly_attribute, 'silly')
def test_comparisons(self):
a = Node('a')
b = Node('b')
self.assertTrue(a == a)
self.assertFalse(a == b)
self.assertTrue(a < b)
self.assertFalse(b < a)
def test_bad_flattening(self):
a = Node('a')
with self.assertRaises(ValueError):
a | 7
@patch(
'consecution.nodes.Node._build_pydot_graph', lambda a: FakeDigraph())
def test_graphviz_not_installed(self):
a = Node('a')
b = Node('b')
p = a | b
with self.assertRaises(RuntimeError):
p.plot()
def test_no_getitem(self):
a = Node('a')
with self.assertRaises(ValueError):
a['b']
def test_bad_slot_name(self):
a = Node('a')
b = Node('b')
with self.assertRaises(ValueError):
a._get_exposed_slots(b, 'bad_arg')
class ExplicitWiringTests(TestCase):
def setUp(self):
self.temp_dir = tempfile.mkdtemp()
def tearDown(self):
shutil.rmtree(self.temp_dir)
def do_wiring(self):
self.do_explicit_wiring()
def do_explicit_wiring(self):
# define nodes
a = Node('a')
b = Node('b')
c = Node('c')
d = Node('d')
e = Node('e')
f = Node('f')
g = Node('g')
h = Node('h')
i = Node('i')
j = Node('j')
k = Node('k')
l = Node('l') # noqa. okay to use l as var here
m = Node('m')
n = Node('n')
# save a list of all nodes
self.node_list = [a, b, c, d, e, f, g, h, i, j, k, l, m, n]
self.top_node = a
# wire up the nodes
a.add_downstream(b)
a.add_downstream(c)
c.add_downstream(d)
c.add_downstream(e)
e.add_downstream(f)
e.add_downstream(g)
e.add_downstream(h)
e.add_downstream(i)
f.add_downstream(j)
g.add_downstream(j)
h.add_downstream(j)
i.add_downstream(j)
d.add_downstream(k)
j.add_downstream(k)
b.add_downstream(l)
k.add_downstream(l)
l.add_downstream(m)
l.add_downstream(n)
# same network in graph notation
# a | [
# b,
# c | [
# d,
# e | [f, g, h, i, my_router] | j
# ] | k
# ] | l [m, n]
def do_graph_wiring(self):
# define nodes
a = Node('a')
b = Node('b')
c = Node('c')
d = Node('d')
e = Node('e')
f = Node('f')
g = Node('g')
h = Node('h')
i = Node('i')
j = Node('j')
k = Node('k')
l = Node('l') # noqa. okay to use l as var here
m = Node('m')
n = Node('n')
# save a list of all nodes
self.node_list = [a, b, c, d, e, f, g, h, i, j, k, l, m, n]
self.top_node = a
a | [ # noqa
b,
c | [
d,
e | [f, g, h, i] | j
] | k
] | l | [m, n]
def test_connections(self):
Conns = namedtuple('Conns', 'node upstreams downstreams')
self.do_wiring()
n = {
node.name: Conns(
node.name,
{u.name for u in node._upstream_nodes},
{d.name for d in node._downstream_nodes}
)
for node in self.node_list
}
self.assertEqual(n['a'].upstreams, set())
self.assertEqual(n['a'].downstreams, {'b', 'c'})
self.assertEqual(n['b'].upstreams, {'a'})
self.assertEqual(n['b'].downstreams, {'l'})
self.assertEqual(n['c'].upstreams, {'a'})
self.assertEqual(n['c'].downstreams, {'d', 'e'})
self.assertEqual(n['e'].upstreams, {'c'})
self.assertEqual(n['e'].downstreams, {'f', 'g', 'h', 'i'})
self.assertEqual(n['f'].upstreams, {'e'})
self.assertEqual(n['f'].downstreams, {'j'})
self.assertEqual(n['g'].upstreams, {'e'})
self.assertEqual(n['g'].downstreams, {'j'})
self.assertEqual(n['h'].upstreams, {'e'})
self.assertEqual(n['h'].downstreams, {'j'})
self.assertEqual(n['i'].upstreams, {'e'})
self.assertEqual(n['i'].downstreams, {'j'})
self.assertEqual(n['d'].upstreams, {'c'})
self.assertEqual(n['d'].downstreams, {'k'})
self.assertEqual(n['j'].upstreams, {'f', 'g', 'h', 'i'})
self.assertEqual(n['j'].downstreams, {'k'})
self.assertEqual(n['k'].upstreams, {'j', 'd'})
self.assertEqual(n['k'].downstreams, {'l'})
self.assertEqual(n['l'].upstreams, {'k', 'b'})
self.assertEqual(n['l'].downstreams, {'m', 'n'})
def test_all_nodes(self):
self.do_wiring()
expected_set = set(self.node_list)
all_nodes_set = [
set(node.all_nodes) for node in self.node_list
]
self.assertTrue(all(
[expected_set == found_set for found_set in all_nodes_set]))
def test_top_node(self):
self.do_wiring()
top_node_set = {node.top_node for node in self.node_list}
self.assertEqual(top_node_set, {self.top_node})
def test_duplicate_node(self):
self.do_wiring()
# this test is funky in that it has assertion in a loop.
# but I wanted to be sure cycles are detected everywhere
for name in [n.name for n in self.top_node.all_nodes]:
dup = Node(name)
with self.assertRaises(ValueError):
self.top_node.add_downstream(dup)
def test_acyclic(self):
self.do_wiring()
# this test is funky in that it has assertion in a loop.
# but I wanted to be sure dups are detected everywhere
for node in self.top_node.all_nodes:
with self.assertRaises(ValueError):
node.add_downstream(self.top_node)
def test_multi_root(self):
self.do_wiring()
other_root = Node('dual_root')
other_root.add_downstream(self.top_node._downstream_nodes[0])
with self.assertRaises(ValueError):
other_root.top_node
def test_non_node_connect(self):
node = Node('a')
other = 'not a node'
with self.assertRaises(ValueError):
node.add_downstream(other)
def test_write(self):
# don't run coverage on this because won't test travis with
# both dot installed and not installed.
if dot_installed(): # pragma: no cover
self.do_wiring()
out_file = os.path.join(self.temp_dir, 'out.png')
self.top_node.plot(out_file)
# uncomment the next line if you want to look at the graph
os.system('cp {} /tmp'.format(out_file))
def test_write_bad_kind(self):
self.do_wiring()
with self.assertRaises(ValueError):
self.top_node.plot(kind='bad')
def test_bad_search_direction(self):
self.do_wiring()
with self.assertRaises(ValueError):
self.top_node.breadth_first_walk(direction='bad')
def test_bad_search_method(self):
self.do_wiring()
with self.assertRaises(ValueError):
self.top_node.walk(how='bad')
class DSLWiringTests(ExplicitWiringTests):
def do_wiring(self):
self.do_graph_wiring()
class TopDownCallTests(TestCase):
def test_call_order_okay(self):
# a toy class that holds a class variable
# tracking what order objects get called in
class MyNode(Node):
call_list = []
def end(self):
self.__class__.call_list.append(self)
a = MyNode('a')
b = MyNode('b')
c = MyNode('c')
d = MyNode('d')
e = MyNode('e')
f = MyNode('f')
g = MyNode('g')
a | [
b | c,
d | e | f
] | g
a.top_node.top_down_call('end')
# make a dictionary with order in which nodes
# were called
call_number = {
node: ind for (ind, node) in enumerate(a.__class__.call_list)}
# make sure ording of one branch is right
self.assertTrue(call_number[a] < call_number[b])
self.assertTrue(call_number[b] < call_number[c])
self.assertTrue(call_number[c] < call_number[g])
# make sure ordering of other branch is okay
self.assertTrue(call_number[a] < call_number[d])
self.assertTrue(call_number[d] < call_number[e])
self.assertTrue(call_number[e] < call_number[f])
self.assertTrue(call_number[f] < call_number[g])
class BreadthFirstSearchTests(TestCase):
def test_top_down_order(self):
a = Node('a')
b = Node('b')
c = Node('c')
d = Node('d')
e = Node('e')
f = Node('f')
h = Node('h')
i = Node('i')
def silly_router(item): # pragma: no cover
return 0
a | [b, c] | [d, e, f, silly_router] | [h, i]
nodes = a.top_node.breadth_first_walk(
direction='down', as_ordered_list=True)
level5 = {nodes.pop() for nn in range(2)}
level4 = {nodes.pop() for nn in range(3)}
level3 = {nodes.pop() for nn in range(2)}
level2 = {nodes.pop() for nn in range(2)}
level1 = {nodes.pop() for nn in range(1)}
self.assertEqual(level1, {a})
self.assertEqual(level2, {b, c})
self.assertEqual(len(level3), 2)
self.assertEqual(level4, {d, e, f})
self.assertEqual(level5, {h, i})
def test_bottom_up_order(self):
a = Node('a')
b = Node('b')
c = Node('c')
d = Node('d')
e = Node('e')
f = Node('f')
h = Node('h')
def silly_router(item): # pragma: no cover
return 0
a | [b, c] | [d, e, f, silly_router] | h
nodes = h.breadth_first_walk(direction='up', as_ordered_list=True)
nodes = nodes[::-1]
level5 = {nodes.pop() for nn in range(1)}
level4 = {nodes.pop() for nn in range(3)}
level3 = {nodes.pop() for nn in range(2)}
level2 = {nodes.pop() for nn in range(2)}
level1 = {nodes.pop() for nn in range(1)}
self.assertEqual(level1, {a})
self.assertEqual(level2, {b, c})
self.assertEqual(len(level3), 2)
self.assertEqual(level4, {d, e, f})
self.assertEqual(level5, {h})
class PrintingTests(TestCase):
def setUp(self):
# define nodes
a = Node('a')
b = Node('b')
c = Node('c')
d = Node('d')
e = Node('e')
f = Node('f')
g = Node('g')
h = Node('h')
i = Node('i')
j = Node('j')
k = Node('k')
l = Node('l') # noqa okay to use l here
m = Node('m')
n = Node('n')
class DummyPipeline(object):
pass
pipeline = DummyPipeline()
# save a list of all nodes
self.node_list = [a, b, c, d, e, f, g, h, i, j, k, l, m, n]
self.top_node = a
def my_router(item): # pragma: no cover
return 'm'
# wire up nodes using dsl
a | [
b, # noqa
c | [
d,
e | [f, g, h, i] | j
] | k
] | l | [m, n, my_router]
for node in self.top_node.all_nodes:
node.pipeline = pipeline
def test_nothing(self):
self.top_node.top_down_make_repr()
lines = sorted([
line.strip()
for line in self.top_node.pipeline._node_repr.split('\n')
if line.strip()
])
expected_lines = sorted([
'a | [b, c]',
'b | l',
'c | [d, e]',
'd | k',
'e | [f, g, h, i]',
'f | j',
'g | j',
'h | j',
'i | j',
'j | k',
'k | l',
'l | l.my_router',
'l.my_router | [m, n]',
])
self.assertEqual(lines, expected_lines)
class RoutingTests(TestCase):
def test_nothing(self):
a = Node('a')
b = Node('b')
c = Node('c')
d = Node('d')
e = Node('e')
def silly_router(item): # pragma: no cover
return 0
class ClassRouter(object): # pragma: no cover
def __call__(self, arg):
return arg
a | [b, c, ClassRouter()] | [d, e, silly_router]
================================================
FILE: consecution/tests/pipeline_tests.py
================================================
from __future__ import print_function
from collections import namedtuple, Counter
from unittest import TestCase
from consecution.nodes import Node, GroupByNode
from consecution.pipeline import Pipeline, GlobalState
from consecution.tests.testing_helpers import print_catcher
Item = namedtuple('Item', 'value parent source')
class Item(object): # pragma: no cover (just a testing helper)
def __init__(self, value, parent, source):
self.value = value
self.parent = parent
self.source = source
def build_source_list(self, source_list=None):
source_list = [] if source_list is None else source_list
source_list.append(self.source)
if self.parent:
self.parent.build_source_list(source_list)
return source_list
def get_path_string(self):
return '|'.join([str(self.value)] + self.build_source_list()[::-1])
def __str__(self):
return self.get_path_string()
def __repr__(self):
return self.get_path_string()
class TestNode(Node):
def process(self, item):
self.push(
Item(value=item.value, parent=item, source=self.name)
)
class ResultNode(Node):
def process(self, item):
self.global_state.final_items.append(item)
class BadNode(Node):
def begin(self):
self.push(1)
def process(self, item): # pragma: no cover this should never get hit.
self.push(item)
def item_generator():
for ind in range(1, 3):
yield Item(
value=ind,
parent=None,
source='generator'
)
class TestBase(TestCase):
def setUp(self):
a = TestNode('a')
b = TestNode('b')
c = TestNode('c')
d = TestNode('d')
even = TestNode('even')
odd = TestNode('odd')
g = TestNode('g')
def even_odd(item):
return ['even', 'odd'][item.value % 2]
a | b | [c, d] | [even, odd, even_odd] | g
self.pipeline = Pipeline(a, global_state=GlobalState(final_items=[]))
class GlobalStateUnitTests(TestCase):
def test_kwargs_passed(self):
g = GlobalState(custom_name='custom')
p = Pipeline(TestNode('a'), global_state=g)
self.assertTrue(p.global_state.custom_name == 'custom')
self.assertTrue(p.global_state['custom_name'] == 'custom')
def test_printing(self):
g = GlobalState(custom_name='custom')
with print_catcher() as catcher1:
print(g)
with print_catcher() as catcher2:
print(repr(g))
self.assertTrue(
'GlobalState(\'custom_name\')' in catcher1.txt)
self.assertTrue(
'GlobalState(\'custom_name\')' in catcher2.txt)
class OrOpTests(TestCase):
def test_ror(self):
a = Node('a')
b = Node('b')
c = Node('c')
d = Node('d')
p = Pipeline(a | ([b, c] | d))
with print_catcher() as catcher:
print(p)
self.assertTrue('a | [b, c]' in catcher.txt)
self.assertTrue('c | d' in catcher.txt)
self.assertTrue('b | d' in catcher.txt)
class ManualFeedTests(TestCase):
def test_manual_feed(self):
class N(Node):
def begin(self):
self.global_state.out_list = []
def process(self, item):
self.global_state.out_list.append(item)
pipeline = Pipeline(TestNode('a') | N('b'))
pushed_list = []
for item in item_generator():
pushed_list.append(item)
pipeline.push(item)
pipeline.end()
self.assertEqual(len(pipeline.global_state.out_list), 2)
class PipelineUnitTests(TestCase):
def test_push_in_begin(self):
pipeline = Pipeline(BadNode('a') | TestNode('b'))
with self.assertRaises(AttributeError):
pipeline.begin()
def test_no_process(self):
class N(Node):
pass
pipe = Pipeline(N('a') | N('b'))
with self.assertRaises(NotImplementedError):
pipe.consume(range(3))
def test_bad_route(self):
def bad_router(item):
return 'bad'
class N(Node):
def process(self, item):
self.push(item)
pipeline = Pipeline(N('a') | [N('b'), N('c'), bad_router])
with self.assertRaises(ValueError):
pipeline.consume(range(3))
def test_bad_node_lookup(self):
pipeline = Pipeline(TestNode('a') | TestNode('b'))
with self.assertRaises(KeyError):
pipeline['c']
def test_bad_replacement_name(self):
pipeline = Pipeline(TestNode('a') | TestNode('b'))
with self.assertRaises(ValueError):
pipeline['b'] = TestNode('c')
def test_flattened_list(self):
pipeline = Pipeline(
TestNode('a') | [[Node('b'), Node('c')]])
with print_catcher() as catcher:
print(pipeline)
self.assertTrue('a | [b, c]' in catcher.txt)
def test_logging(self):
pipeline = Pipeline(TestNode('a') | TestNode('b'))
pipeline['a'].log('output')
pipeline['b'].log('input')
with print_catcher() as catcher:
pipeline.consume(item_generator())
text = """
node_log,what,node_name,item
node_log,output,a,1|generator|a
node_log,input,b,1|generator|a
node_log,output,a,2|generator|a
node_log,input,b,2|generator|a
"""
for line in text.split('\n'):
self.assertTrue(line.strip() in catcher.txt)
def test_reset(self):
class N(Node):
def begin(self):
self.was_reset = False
def process(self, item):
self.push(item)
def reset(self):
self.was_reset = True
pipe = Pipeline(N('a') | N('b'))
pipe.consume(range(3))
self.assertFalse(pipe['a'].was_reset)
self.assertFalse(pipe['b'].was_reset)
pipe.reset()
self.assertTrue(pipe['a'].was_reset)
self.assertTrue(pipe['b'].was_reset)
class LoggingTests(TestBase):
def test_logging(self):
self.pipeline['g'].log('input')
with print_catcher() as printer:
self.pipeline.consume(item_generator())
counter = Counter()
for line in printer.lines():
even_odd = line.split('|')[-1]
counter.update({even_odd: 1})
self.assertEqual(counter['even'], 2)
self.assertEqual(counter['odd'], 2)
class ReplacementTests(TestBase):
def test_replace_first(self):
class Replacement(Node):
def process(self, item):
self.push(
Item(value=10 * item.value, parent=item, source=self.name)
)
self.pipeline['a'] = Replacement('a')
self.pipeline['a'].log('output')
with print_catcher() as printer:
self.pipeline.consume(item_generator())
self.assertEqual(printer.txt.count('10'), 1)
self.assertEqual(printer.txt.count('20'), 1)
def test_replace_even(self):
class Replacement(Node):
def process(self, item):
self.push(
Item(value=10 * item.value, parent=item, source=self.name)
)
self.pipeline['even'] = Replacement('even')
self.pipeline['g'].log('output')
with print_catcher() as printer:
self.pipeline.consume(item_generator())
self.assertEqual(printer.txt.count('1'), 2)
self.assertEqual(printer.txt.count('20'), 2)
def test_replace_no_router(self):
a = TestNode('a')
b = TestNode('b')
pipe = Pipeline(a | b)
pipe['b'] = TestNode('b')
with print_catcher() as catcher:
print(pipe)
self.assertTrue('a | b' in catcher.txt)
class ConsumingTests(TestBase):
def test_even_odd(self):
self.pipeline['g'].add_downstream(
ResultNode('result_node')
)
self.pipeline.consume(item_generator())
expected_path_set = set([
'1|generator|a|b|c|odd|g',
'1|generator|a|b|d|odd|g',
'2|generator|a|b|c|even|g',
'2|generator|a|b|d|even|g',
])
path_set = set(
item.get_path_string() for item in
self.pipeline.global_state.final_items
)
self.assertEqual(expected_path_set, path_set)
class ConstructingTests(TestBase):
def test_printing(self):
lines = repr(self.pipeline).split('\n')
self.assertEqual(len(lines), 13)
def test_plotting(self):
# don't want to force a mock dependency, so make a simple mock here
args_kwargs = []
def return_calls(*args, **kwargs):
args_kwargs.append(args)
args_kwargs.append(kwargs)
# assign my mock to the top node plot function
self.pipeline.top_node.plot = return_calls
# call pipeline plot
self.pipeline.plot()
# make sure top node plot was properly called
self.assertEqual(args_kwargs[0], ('pipeline', 'png'))
self.assertEqual(args_kwargs[1], {})
class Batch(GroupByNode):
def begin(self):
self.global_state.batches = []
def key(self, item):
return item // 3
def process(self, batch):
self.global_state.batches.append(batch)
class GroupByTests(TestCase):
def test_batching(self):
pipe = Pipeline(Batch('a'))
pipe.consume(range(9))
self.assertEqual(
pipe.global_state.batches,
[[0, 1, 2], [3, 4, 5], [6, 7, 8]]
)
def test_undefined_key(self):
class B(GroupByNode):
def process(self, item): # pragma: no cover
pass
pipe = Pipeline(B('a'))
with self.assertRaises(NotImplementedError):
pipe.consume(range(9))
def test_undefined_process(self):
class B(GroupByNode):
def key(self, item):
pass
pipe = Pipeline(B('a'))
with self.assertRaises(NotImplementedError):
pipe.consume(range(9))
================================================
FILE: consecution/tests/testing_helpers.py
================================================
import sys
from contextlib import contextmanager
# These don't need to covered. They are just tesing utilities
@contextmanager
def print_catcher(buff='stdout'): # pragma: no cover
if buff == 'stdout':
sys.stdout = Printer()
yield sys.stdout
sys.stdout = sys.__stdout__
elif buff == 'stderr':
sys.stderr = Printer()
yield sys.stderr
sys.stderr = sys.__stderr__
else: # pragma: no cover This is just to help testing. No need to cover.
raise ValueError('buff must be either \'stdout\' or \'stderr\'')
class Printer(object): # pragma: no cover
def __init__(self):
self.txt = ""
def write(self, txt):
self.txt += txt
def lines(self):
for line in self.txt.split('\n'):
yield line.strip()
================================================
FILE: consecution/tests/utils_tests.py
================================================
from __future__ import print_function
from unittest import TestCase
from consecution.utils import Clock
import time
from consecution.tests.testing_helpers import print_catcher
class ClockTests(TestCase):
def test_bad_start(self):
clock = Clock()
with self.assertRaises(ValueError):
clock.start()
def test_printing(self):
clock = Clock()
with clock.running('a', 'b', 'c'):
with clock.paused('a'):
time.sleep(.1)
with clock.paused('b'):
time.sleep(.1)
with print_catcher() as printer:
print(repr(clock))
names = []
for ind, line in enumerate(printer.txt.split('\n')):
if line:
if ind > 0:
names.append(line.split()[-1])
self.assertEqual(names, ['c', 'b', 'a'])
def test_get_time_of_running(self):
clock = Clock()
with clock.running('a'):
time.sleep(.1)
delta1 = int(10 * clock.get_time())
time.sleep(.1)
delta2 = int(10 * clock.get_time())
self.assertEqual(delta1, 1)
self.assertEqual(delta2, 2)
def test_pausing(self):
clock = Clock()
with clock.running('a', 'b', 'c'):
time.sleep(.1)
with clock.paused('b', 'c'):
time.sleep(.1)
self.assertEqual(int(10 * clock.get_time('a')), 2)
self.assertEqual(int(10 * clock.get_time('b')), 1)
self.assertEqual(int(10 * clock.get_time('c')), 1)
self.assertEqual(
{int(10 * v) for v in clock.get_time().values()},
{1, 2}
)
def test_stop_all(self):
clock = Clock()
clock.start('a', 'b')
time.sleep(.1)
clock.stop()
self.assertEqual(int(10 * clock.get_time('a')), 1)
self.assertEqual(int(10 * clock.get_time('b')), 1)
def test_reset_all(self):
clock = Clock()
clock.start('a', 'b')
time.sleep(.1)
clock.stop('b')
self.assertEqual(len(clock.delta), 1)
clock.reset()
self.assertEqual(len(clock.get_time()), 0)
def test_double_calls(self):
clock = Clock()
clock.start('a')
clock.start('a')
time.sleep(.1)
clock.stop('a')
clock.stop('a')
self.assertEqual(int(round(10 * clock.get_time())), 1)
clock.reset('a')
clock.reset('a')
clock.reset('b')
clock.reset('b')
self.assertEqual(clock.get_time(), {})
def test_get_time_delta_only(self):
clock = Clock()
clock.start('a')
clock.stop('a')
self.assertEqual(clock.get_time('f'), {})
================================================
FILE: consecution/utils.py
================================================
from collections import Counter
from contextlib import contextmanager
import datetime
class Clock(object):
def __init__(self):
# see the reset method for instance attributes
self.delta = Counter()
self.active_start_times = dict()
@contextmanager
def running(self, *names):
self.start(*names)
yield
self.stop(*names)
@contextmanager
def paused(self, *names):
self.stop(*names)
yield
self.start(*names)
def start(self, *names):
if not names:
raise ValueError('You must provide at least one name to start')
for name in names:
if name not in self.active_start_times:
self.active_start_times[name] = datetime.datetime.now()
def stop(self, *names):
ending = datetime.datetime.now()
if not names:
names = list(self.active_start_times.keys())
for name in names:
if name in self.active_start_times:
starting = self.active_start_times.pop(name)
self.delta.update({name: (ending - starting).total_seconds()})
def reset(self, *names):
if not names:
names = list(self.active_start_times.keys())
names.extend(list(self.delta.keys()))
for name in names:
if name in self.delta:
self.delta.pop(name)
if name in self.active_start_times:
self.active_start_times.pop(name)
def get_time(self, *names):
ending = datetime.datetime.now()
if not names:
names = list(self.delta.keys())
names.extend(list(self.active_start_times.keys()))
delta = Counter()
for name in names:
if name in self.delta:
delta.update({name: self.delta[name]})
elif name in self.active_start_times:
delta.update(
{
name: (
ending - self.active_start_times[name]
).total_seconds()
}
)
if len(delta) == 1:
return delta[list(delta.keys())[0]]
else:
return dict(delta)
def __str__(self):
records = sorted(self.delta.items(), key=lambda t: t[1], reverse=True)
records = [('%0.6f' % r[1], r[0]) for r in records]
out_list = ['{: <15s}{}'.format('seconds', 'name')]
for rec in records:
out_list.append('{: <15s}{}'.format(*rec))
return '\n'.join(out_list)
def __repr__(self):
return self.__str__()
================================================
FILE: docker/Dockerfile
================================================
FROM ubuntu:xenial
# root is the home directory
WORKDIR /root
ADD simple_example.py /root/simple_example.py
# set up the system tools including conda
RUN \
rm /bin/sh && ln -s /bin/bash /bin/sh && \
apt-get update && \
apt-get install -y vim && \
apt-get install -y git && \
apt-get install -y wget && \
apt-get install -y curl && \
apt-get install -y graphviz && \
apt-get install -y python-dev
RUN \
curl -sS https://bootstrap.pypa.io/get-pip.py | python
RUN \
pip install git+https://github.com/robdmc/consecution.git
================================================
FILE: docker/docker_build.sh
================================================
#! /usr/bin/env bash
docker build . -t consecution
================================================
FILE: docker/docker_run.sh
================================================
#! /usr/bin/env bash
docker run -it --rm -v $(pwd):/root/shared consecution /bin/bash
================================================
FILE: docker/simple_example.py
================================================
#! /usr/bin/env python
# TODO: make the consecution install in the docker file read from pip
from __future__ import print_function
from consecution import Node, Pipeline
class N(Node):
def process(self, item):
print(item, self.name)
self.push(item)
p = Pipeline(
N('a') | [N('b'), N('c')] | N('d')
)
p.plot()
p.consume(range(5))
================================================
FILE: docs/Makefile
================================================
# Makefile for Sphinx documentation
#
# You can set these variables from the command line.
SPHINXOPTS =
SPHINXBUILD = sphinx-build
PAPER =
BUILDDIR = _build
# User-friendly check for sphinx-build
ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1)
$(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/)
endif
# Internal variables.
PAPEROPT_a4 = -D latex_paper_size=a4
PAPEROPT_letter = -D latex_paper_size=letter
ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .
# the i18n builder cannot share the environment and doctrees with the others
I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .
.PHONY: help clean html dirhtml singlehtml pickle json htmlhelp epub latex latexpdf text man changes linkcheck doctest gettext
help:
@echo "Please use \`make <target>' where <target> is one of"
@echo " html to make standalone HTML files"
@echo " dirhtml to make HTML files named index.html in directories"
@echo " singlehtml to make a single large HTML file"
@echo " pickle to make pickle files"
@echo " json to make JSON files"
@echo " htmlhelp to make HTML files and a HTML help project"
@echo " epub to make an epub"
@echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter"
@echo " latexpdf to make LaTeX files and run them through pdflatex"
@echo " latexpdfja to make LaTeX files and run them through platex/dvipdfmx"
@echo " text to make text files"
@echo " man to make manual pages"
@echo " texinfo to make Texinfo files"
@echo " info to make Texinfo files and run them through makeinfo"
@echo " gettext to make PO message catalogs"
@echo " changes to make an overview of all changed/added/deprecated items"
@echo " xml to make Docutils-native XML files"
@echo " pseudoxml to make pseudoxml-XML files for display purposes"
@echo " linkcheck to check all external links for integrity"
@echo " doctest to run all doctests embedded in the documentation (if enabled)"
clean:
rm -rf $(BUILDDIR)/*
html:
$(SPHINXBUILD) -W -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
@echo
@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."
dirhtml:
$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml
@echo
@echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml."
singlehtml:
$(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml
@echo
@echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml."
pickle:
$(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle
@echo
@echo "Build finished; now you can process the pickle files."
json:
$(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json
@echo
@echo "Build finished; now you can process the JSON files."
htmlhelp:
$(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp
@echo
@echo "Build finished; now you can run HTML Help Workshop with the" \
".hhp project file in $(BUILDDIR)/htmlhelp."
epub:
$(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub
@echo
@echo "Build finished. The epub file is in $(BUILDDIR)/epub."
latex:
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
@echo
@echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex."
@echo "Run \`make' in that directory to run these through (pdf)latex" \
"(use \`make latexpdf' here to do that automatically)."
latexpdf:
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
@echo "Running LaTeX files through pdflatex..."
$(MAKE) -C $(BUILDDIR)/latex all-pdf
@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
latexpdfja:
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
@echo "Running LaTeX files through platex and dvipdfmx..."
$(MAKE) -C $(BUILDDIR)/latex all-pdf-ja
@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
text:
$(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text
@echo
@echo "Build finished. The text files are in $(BUILDDIR)/text."
man:
$(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man
@echo
@echo "Build finished. The manual pages are in $(BUILDDIR)/man."
texinfo:
$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
@echo
@echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo."
@echo "Run \`make' in that directory to run these through makeinfo" \
"(use \`make info' here to do that automatically)."
info:
$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
@echo "Running Texinfo files through makeinfo..."
make -C $(BUILDDIR)/texinfo info
@echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo."
gettext:
$(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale
@echo
@echo "Build finished. The message catalogs are in $(BUILDDIR)/locale."
changes:
$(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes
@echo
@echo "The overview file is in $(BUILDDIR)/changes."
linkcheck:
$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
@echo
@echo "Link check complete; look for any errors in the above output " \
"or in $(BUILDDIR)/linkcheck/output.txt."
doctest:
$(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest
@echo "Testing of doctests in the sources finished, look at the " \
"results in $(BUILDDIR)/doctest/output.txt."
xml:
$(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml
@echo
@echo "Build finished. The XML files are in $(BUILDDIR)/xml."
pseudoxml:
$(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml
@echo
@echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml."
================================================
FILE: docs/conf.py
================================================
# -*- coding: utf-8 -*-
#
import inspect
import os
import re
def get_version():
"""Obtain the packge version from a python file e.g. pkg/__init__.py
See <https://packaging.python.org/en/latest/single_source_version.html>.
"""
file_dir = os.path.realpath(os.path.dirname(__file__))
with open(
os.path.join(file_dir, '..', 'consecution', '__init__.py')) as f:
txt = f.read()
version_match = re.search(
r"""^__version__ = ['"]([^'"]*)['"]""", txt, re.M)
if version_match:
return version_match.group(1)
raise RuntimeError("Unable to find version string.")
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#sys.path.insert(0, os.path.abspath('.'))
# -- General configuration ------------------------------------------------
extensions = [
'sphinx.ext.autodoc',
'sphinx.ext.intersphinx',
#'sphinx.ext.viewcode',
]
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# The suffix of source filenames.
source_suffix = '.rst'
# The master toctree document.
master_doc = 'toc'
# General information about the project.
project = 'consecution'
copyright = '2017, Rob deCarvalho'
# The short X.Y version.
version = get_version()
# The full version, including alpha/beta/rc tags.
release = version
exclude_patterns = ['_build']
# The name of the Pygments (syntax highlighting) style to use.
pygments_style = 'sphinx'
intersphinx_mapping = {
'python': ('http://docs.python.org/3.4', None),
'django': ('http://django.readthedocs.org/en/latest/', None),
#'celery': ('http://celery.readthedocs.org/en/latest/', None),
}
# -- Options for HTML output ----------------------------------------------
html_theme = 'default'
#html_theme_path = []
on_rtd = os.environ.get('READTHEDOCS', None) == 'True'
if not on_rtd: # only import and set the theme if we're building docs locally
import sphinx_rtd_theme
html_theme = 'sphinx_rtd_theme'
html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
# html_static_path = ['_static']
# Custom sidebar templates, maps document names to template names.
#html_sidebars = {}
# Additional templates that should be rendered to pages, maps page names to
# template names.
#html_additional_pages = {}
# If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
html_show_sphinx = False
# If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
html_show_copyright = True
# Output file base name for HTML help builder.
htmlhelp_basename = 'consecutiondoc'
# -- Options for LaTeX output ---------------------------------------------
latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
#'papersize': 'letterpaper',
# The font size ('10pt', '11pt' or '12pt').
#'pointsize': '10pt',
# Additional stuff for the LaTeX preamble.
#'preamble': '',
}
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = [
('index', 'consecution.tex', 'consecution Documentation',
'Rob deCarvalho', 'manual'),
]
# -- Options for manual page output ---------------------------------------
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
('index', 'consecution', 'consecution Documentation',
['Rob deCarvalho'], 1)
]
# -- Options for Texinfo output -------------------------------------------
# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
('index', 'consecution', 'consecution Documentation',
'Rob deCarvalho', 'consecution', 'A short description',
'Miscellaneous'),
]
def process_django_model_docstring(app, what, name, obj, options, lines):
"""
Does special processing for django model docstrings, making docs for
fields in the model.
"""
# This causes import errors if left outside the function
from django.db import models
# Only look at objects that inherit from Django's base model class
if inspect.isclass(obj) and issubclass(obj, models.Model):
# Grab the field list from the meta class
fields = obj._meta.fields
for field in fields:
# Decode and strip any html out of the field's help text
help_text = strip_tags(force_unicode(field.help_text))
# Decode and capitalize the verbose name, for use if there isn't
# any help text
verbose_name = force_unicode(field.verbose_name).capitalize()
if help_text:
# Add the model field to the end of the docstring as a param
# using the help text as the description
lines.append(':param %s: %s' % (field.attname, help_text))
else:
# Add the model field to the end of the docstring as a param
# using the verbose name as the description
lines.append(':param %s: %s' % (field.attname, verbose_name))
# Add the field's type to the docstring
lines.append(':type %s: %s' % (field.attname, type(field).__name__))
# Return the extended docstring
return lines
def setup(app):
# Register the docstring processor with sphinx
app.connect('autodoc-process-docstring', process_django_model_docstring)
================================================
FILE: docs/index.rst
================================================
Overview
=============================
Consecution is:
* An easy-to-use pipeline abstraction inspired by
`Apache Storm Topologies <http://storm.apache.org/releases/current/Tutorial.html>`_.
* Designed to simplify building ETL pipelines that are robust and easy to test
* A system for wiring together simple processing nodes to form a DAG, which is fed with a python iterable
* Built using synchronous, single-threaded execution strategies designed to run efficiently on a single core
* Implemented in pure-python with optional requirements that are needed only for graph visualization
* Written with 100% test coverage
See the
`Github project page <https://github.com/robdmc/consecution>`_.
for examples of how to use `consecution`.
================================================
FILE: docs/ref/consecution.rst
================================================
.. _ref-consecution:
API documentation
==================
Node
----
Nodes are the fundamental processing unit in consecution. A node is created by
inheriting from the `consecution.Node` class. You are free to declare as many
attributes and methods on a node class as you wish. You should not override the
constructor unless you really know what you're doing. Instead, any
initialization you wish to perform can be carried out in the `.begin()` method.
In the descriptions below, it is assumed that the nodes being discussed have
been wired together into a pipeline and are ready to consume items.
See the
`Github README
<https://github.com/robdmc/consecution/blob/master/README.md>`_
for examples of how to wire nodes into pipelines.
Reserved Method Names
~~~~~~~~~~~~~~~~~~~~~
The following Node methods are not intended to be overridden, so you should not
define methods with these names in your node implementations unless you really
know what you are doing.
* `top_node`
* `initial_node_set`
* `terminal_node_set`
* `root_nodes`
* `all_nodes`
* `log`
* `top_down_make_repr`
* `top_down_call`
* `depth_first_search`
* `breadth_first_search`
* `search`
* `add_downstream`
* `remove_downstream`
* `plot`
There are also a number of private method names you should avoid. These can be
identified by looking at the `source code
<https://github.com/robdmc/consecution/blob/master/consecution/nodes.py>`_
Examples
~~~~~~~~
Here is the simplest possible node you could construct:
.. code-block:: python
from consecution import Node
class MyNode(Node):
def process(self, item):
self.push(item)
All nodes acquire a `.push()` method when they are wired into a pipeline. You
can call this method anywhere in your class except in the `.begin()` method.
The `.push(item)` method will take its argument and send it to the `.process()`
methods of the nodes that are immediately downstream in your pipeline graph.
Here is an example node defining all methods you can override. The
functionality of each method is explained in the code comments.
.. code-block:: python
from consecution import Node
class MyNode(Node):
def begin(self):
# This sets up whatever state you want to exist before the
# node begins processing any data. You can think of it as an
# init method that runs just before the node starts processing.
# In this example, we initialize a simple counter
self.counter = 0
def process(self, item):
# This is the method that defines the processing you want to perform
# on every item the node processes. You can place whatever logic
# you want here, including calls the the .push() method.
# In this example, we update the counter and push the item
# downstream.
self.counter += 1
self.push(item)
def end(self):
# This method is called right after all items are processed.
# This happens when the iterator being consumed by the pipeline
# is exhausted. At that point the .end() methods of all nodes
# in the pipeline are called. This is a good place for you to
# push any summary information downstream.
# In this example we push the results of our counter
self.push(self.counter)
def reset(self):
# A pipeline can be reused and reset back to its initial condition.
# It does this by calling the .reset() method of all its member
# nodes. You can place whatever code you want here to reset your
# node to its initial state.
# In this example, we simply reset the counter.
self.counter = 0
Node API Documentation
~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: consecution.nodes.Node
:members:
GroupBy Node
~~~~~~~~~~~~~~~~~~~~~~
Consecution provides a special Node class specifically designed to do grouping.
It works in much the sameway as Python's built in
``itertools.groupby`` function. It expects to process nodes in key-sorted
order. In addition to the ``.process()`` method required of all nodes, you must
also define a ``.key()`` method that will extract a key from each item being
processed. See the Github project page for an example of using Groupby.
.. autoclass:: consecution.nodes.GroupByNode
:members:
Manually Connecting Nodes
-------------------------
The Node base class is equipped with an ``.add_downstream(other_node)`` method.
This method provides detailed control over how nodes are wired together. It
simply adds ``other_node`` as a downstream relation.
Here is an example of creating a pipeline with one top node that broadcasts
items to two downstream nodes, and then collects their results into a single
output node.
.. code-block:: python
from consecution import Pipeline, Node
from __future__ import print_function
class SimpleNode(Node):
def process(self, item):
print('{} processing {}'.format(self.name, item))
self.push(item)
top = SimpleNode('top')
left = SimpleNode('left')
right = SimpleNode('right')
output = SimpleNode('output')
top.add_downstream(left)
top.add_downstream(right)
left.add_downstream(output)
right.add_downstream(output)
pipe = Pipeline(top)
pipe.consume(range(2))
Node Connection Mini-language
-----------------------------
Consecution provides a concise domain-specific-language (DSL) for creating
directed acyclic graphs. This is the preferred method for connecting nodes into
a pipeline. However, you may occasionally find that your desired topology is not
easy to express in the DSL. For these situations, consecution provides a
lower-level escape hatch that allowes you to manually connect two nodes
together. These two levels of abstraction provide a very powerful interface for
constructing complex pipelines.
The DSL is inspired by the unix syntax for chaining together the inputs and
outputs of different programs at the bash prompt. You use the pipe symbol ``|``
to connect nodes together. These pipe operators will always return an object of
one of the nodes in your connected topology. Below is an example of creating a
simple linear pipeline.
.. code-block:: python
from consecution import Pipeline, Node
from __future__ import print_function
class SimpleNode(Node):
def process(self, item):
print('{} processing {}'.format(self.name, item))
self.push(item)
left = SimpleNode('left')
middle = SimpleNode('middle')
right = SimpleNode('right')
# wire nodes together with bash-like pipe operator
node_object = left | middle | right
# You can now pass the node object into a pipeline constructor
pipe = Pipeline(node_object)
pipe.consume(range(2))
In order to create a directed acyclic graph (DAG) you need four basic
constructs:
* Send data from one node to a single other node
* Broadcast data from one node to a set of other nodes
* Route data from one node to one of a set of other nodes
* Gather output from several nodes into one node.
The DSL provides mechanisms for each of these constructs, and we will look at
each in turn.
Send data from single node to single node
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Use simple bash-like pipe syntax to send data from a single node to another
node.
.. code-block:: python
# Send data from one to to a single other node using bash-like piping.
node1 | node2
Broadcast data from single node to multiple node
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Broadcasting is accomplished by piping to a list of nodes. In the following
example, ``node1`` will send each item it pushes to ``node2``, ``node3``, and
``node4``.
.. code-block:: python
# Broadcast to a set of nodes by piping to a list
node1 | [node2, node3, node4]
Routing from one node to one of multiple nodes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Routing is accomplished by piping to a list that contains a single callable and
any number of nodes. The following example will send even numbers to
``even_node`` and odd numbers to ``odd_node``.
.. code-block:: python
# Define a node class
class N(Node):
def process(self, item):
self.push(item)
# Define a routing function. It takes a single argument being the item
# you pushed. It should return a string with the name of the node
# to which that item should be routed.
def route_func(item):
if item % 2 == 0:
return 'even_node'
else:
return 'odd_node'
# Pipe to a list of nodes and a callable to achieve routing
N('top_node') | [N('even_node'), N('odd_node'), route_func]
Gather output from multiple nodes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Gathering output from a set of nodes is as simple as piping a list of nodes (and
possibly a route function) to a single node. In this example, the outputs of
``node2``, ``node3``, and ``node4`` will all be sent to ``node5``.
.. code-block:: python
# Broadcast to a set of nodes by piping to a list
node1 | [node2, node3, node4] | node5
Pipeline
-----------------
Once nodes are wired together, they need to be encapsulated into a pipeline
before they can operate on data. This is done by passing any node in the
network as the argument to the ``Pipeline`` constructor. On construction, the
pipeline will ensure you have a valid processing graph and will execute
initialization code to ensure that the nodes are efficiently connected.
Immediately after construction, the pipeline is ready to consume data.
Consuming Iterables
~~~~~~~~~~~~~~~~~~~
When the ``.consume(iterable)`` method is called a sequence of events occur in
exactly this order.
#. The ``.begin()`` method on the pipeline object is called. You can override
this method to perform any task you'd like.
#. The ``.begin()`` methods of all nodes in the network are called. They are
called in top-down order. What this means is that the ``.begin()`` method of
a node is guaranteed to not be called until the ``.begin()`` methods of all
its ancestors have been called.
#. Items are read from the iterable argument supplied to the ``.consume()``
method. These are fed through the topology of the processing graph one by
one. Each item is completely processed by the graph before the next one is
lifted off the iterable.
#. The ``.end()`` methods of all nodes are called in top-down order.
#. The ``.end()`` method of the pipeline is called.
Manually feeding Pipeline
~~~~~~~~~~~~~~~~~~~~~~~~~~
In addition to consuming iterables, you can manually feed pipelines using the
``.push()`` method on the pipeline itself. When you are finished pushing items,
you can manually call the ``.end()`` method. Here is an example.
.. code-block:: python
from consecution import Node, Pipeline
from __future__ import print_function
class N(Node):
def process(self, item):
print(item)
self.push(item)
pipe = Pipeline(N('first') | N('second'))
for nn in range(2):
pipe.push(nn)
pipe.end()
Pipeline API Documentation
~~~~~~~~~~~~~~~~~~~~~~~~~~
Pipelines support dictionary-like access to their nodes. Here are examples.
.. code-block:: python
from consecution import Node, Pipeline
# Define a node
class N(Node):
def process(self, item):
self.push(item)
# Create a pipeline with two nodes
pipe = Pipeline(N('first') | N('second'))
# Get reference to a node with dictionary syntax
first = pipe['first']
# Replace a node with dictionary-like syntax
pipe['first'] = N('first')
.. autoclass:: consecution.pipeline.Pipeline
:members:
GlobalState
-----------------
The ``GlobalState`` class is a simple python class that supports both
dictionary-like and object-like attribute access. An object of this class will
be used as the default ``global_state`` attribute of a pipeline if you don't
explicitly provide one in the constructor.
.. autoclass:: consecution.pipeline.GlobalState
:members:
================================================
FILE: docs/toc.rst
================================================
Table of Contents
=================
.. toctree::
:maxdepth: 2
index
ref/consecution
================================================
FILE: pandashells.md
================================================
Pandashells One-liner Example
===
<a href="https://github.com/robdmc/pandashells">Pandashells</a> lets you use <a
href="http://pandas.pydata.org/">Pandas</a> from the bash command line. It
allows you to combine unix command-line tools (awk, grep, sed, etc.) with the
power of Pandas Dataframes and Matplotlib visualization.
Here is a one-liner that performs the exact same aggregation demonstrated by the
example consecution pipeline.
```bash
cat sample_data.csv | \
p.df 'df["group"] = ["adult" if a>=18 else "child" for a in df.age]' | \
p.df 'df.pivot_table(index="group", columns="gender", values="spent", margins=True, aggfunc=sum).fillna(0)' \
-o table index
```
================================================
FILE: publish.py
================================================
import subprocess
subprocess.call('pip install wheel'.split())
subprocess.call('python setup.py clean --all'.split())
subprocess.call('python setup.py sdist'.split())
# subprocess.call('pip wheel --no-index --no-deps --wheel-dir dist dist/*.tar.gz'.split())
subprocess.call('python setup.py register sdist bdist_wheel upload'.split())
================================================
FILE: sample_data.csv
================================================
gender,age,spent
male,11,39.39
female,10,34.72
female,15,40.02
male,19,26.27
male,13,21.22
female,40,23.17
female,52,33.42
male,33,39.52
female,16,28.65
male,60,26.74
================================================
FILE: setup.cfg
================================================
[nosetests]
nocapture=1
verbosity=1
with-coverage=1
cover-branches=1
#cover-min-percentage=100
cover-package=consecution
[coverage:report]
show_missing=True
fail_under=100
exclude_lines =
# Have to re-enable the standard pragma
pragma: no cover
# Don't complain if tests don't hit defensive assertion code:
raise NotImplementedError
[coverage:run]
omit =
consecution/version.py
consecution/__init__.py
[flake8]
max-line-length = 120
exclude = docs,env,*.egg
max-complexity = 10
ignore = E402
[build_sphinx]
source-dir = docs/
build-dir = docs/_build
all_files = 1
[upload_sphinx]
upload-dir = docs/_build/html
[bdist_wheel]
universal = 1
================================================
FILE: setup.py
================================================
#!/usr/bin/env python
import io
import os
import re
from setuptools import setup, find_packages
file_dir = os.path.dirname(__file__)
def read(path, encoding='utf-8'):
path = os.path.join(os.path.dirname(__file__), path)
with io.open(path, encoding=encoding) as fp:
return fp.read()
def version(path):
"""Obtain the packge version from a python file e.g. pkg/__init__.py
See <https://packaging.python.org/en/latest/single_source_version.html>.
"""
version_file = read(path)
version_match = re.search(r"""^__version__ = ['"]([^'"]*)['"]""",
version_file, re.M)
if version_match:
return version_match.group(1)
raise RuntimeError("Unable to find version string.")
LONG_DESCRIPTION = """
Consecution is an easy-to-use pipeline abstraction inspired by
Apache Storm topologies.
"""
setup(
name='consecution',
version=version(os.path.join(file_dir, 'consecution', '__init__.py')),
author='Rob deCarvalho',
author_email='unlisted',
description=('Pipeline Abstraction Library'),
license='BSD',
keywords=('pipeline apache storm DAG graph topology ETL'),
url='https://github.com/robdmc/consecution',
packages=find_packages(),
long_description=LONG_DESCRIPTION,
classifiers=[
'Environment :: Console',
'Intended Audience :: Developers',
'Programming Language :: Python',
'Programming Language :: Python :: 2',
'Programming Language :: Python :: 3',
'Programming Language :: Python :: 2.7',
'Programming Language :: Python :: 3.5',
'Topic :: Scientific/Engineering',
],
extras_require={'dev': ['nose', 'coverage', 'mock', 'flake8', 'coveralls']},
install_requires=['graphviz']
)
gitextract_eotr679u/ ├── .coveragerc ├── .gitignore ├── .travis.yml ├── LICENSE ├── README.md ├── consecution/ │ ├── .coverage │ ├── __init__.py │ ├── nodes.py │ ├── pipeline.py │ ├── tests/ │ │ ├── __init__.py │ │ ├── nodes_tests.py │ │ ├── pipeline_tests.py │ │ ├── testing_helpers.py │ │ └── utils_tests.py │ └── utils.py ├── docker/ │ ├── Dockerfile │ ├── docker_build.sh │ ├── docker_run.sh │ └── simple_example.py ├── docs/ │ ├── Makefile │ ├── conf.py │ ├── index.rst │ ├── ref/ │ │ └── consecution.rst │ └── toc.rst ├── pandashells.md ├── publish.py ├── sample_data.csv ├── setup.cfg └── setup.py
SYMBOL INDEX (203 symbols across 10 files)
FILE: consecution/nodes.py
class Node (line 7) | class Node(object):
method __init__ (line 29) | def __init__(self, name, **kwargs):
method __str__ (line 52) | def __str__(self):
method __repr__ (line 55) | def __repr__(self):
method __hash__ (line 58) | def __hash__(self):
method __eq__ (line 64) | def __eq__(self, other):
method __lt__ (line 67) | def __lt__(self, other):
method __getitem__ (line 73) | def __getitem__(self, key):
method _get_flattened_list (line 82) | def _get_flattened_list(self, obj):
method _get_exposed_slots (line 101) | def _get_exposed_slots(self, obj, pointing):
method _connect_lefts_to_rights (line 112) | def _connect_lefts_to_rights(self, lefts, rights, router=None):
method _get_object_name (line 130) | def _get_object_name(self, obj):
method _get_router (line 137) | def _get_router(self, obj):
method __or__ (line 144) | def __or__(self, other):
method __ror__ (line 149) | def __ror__(self, other):
method top_node (line 154) | def top_node(self):
method terminal_node_set (line 168) | def terminal_node_set(self):
method initial_node_set (line 178) | def initial_node_set(self):
method root_nodes (line 192) | def root_nodes(self):
method all_nodes (line 203) | def all_nodes(self):
method log (line 209) | def log(self, what):
method _get_downstream_reps (line 226) | def _get_downstream_reps(self):
method top_down_make_repr (line 239) | def top_down_make_repr(self):
method top_down_call (line 253) | def top_down_call(self, method_name):
method depth_first_walk (line 288) | def depth_first_walk(self, direction='both', as_ordered_list=False):
method breadth_first_walk (line 308) | def breadth_first_walk(self, direction='both', as_ordered_list=False):
method walk (line 328) | def walk(
method _check_for_dups (line 400) | def _check_for_dups(self):
method _check_for_cycles (line 412) | def _check_for_cycles(self):
method _validate_node (line 419) | def _validate_node(self, other):
method add_downstream (line 424) | def add_downstream(self, other):
method remove_downstream (line 444) | def remove_downstream(self, other):
method _build_pydot_graph (line 468) | def _build_pydot_graph(self):
method plot (line 499) | def plot(
method process (line 547) | def process(self, item):
method reset (line 561) | def reset(self):
method _logged_process (line 566) | def _logged_process(self, item):
method _begin (line 571) | def _begin(self):
method begin (line 589) | def begin(self):
method end (line 592) | def end(self):
method _write_log (line 595) | def _write_log(self, item):
method _push (line 598) | def _push(self, item):
class _RouterNode (line 615) | class _RouterNode(Node):
method __init__ (line 620) | def __init__(self, name, end_point_map, route_callable):
method process (line 626) | def process(self, item):
class GroupByNode (line 646) | class GroupByNode(Node):
method __init__ (line 647) | def __init__(self, *args, **kwargs):
method key (line 652) | def key(self, item):
method process (line 667) | def process(self, batch):
method _process_item (line 679) | def _process_item(self, item):
method _end (line 689) | def _end(self):
method __getattribute__ (line 693) | def __getattribute__(self, name):
FILE: consecution/pipeline.py
class GlobalState (line 5) | class GlobalState(object):
method __init__ (line 26) | def __init__(_item_self, **kwargs):
method __str__ (line 30) | def __str__(_item_self):
method __repr__ (line 36) | def __repr__(_item_self):
method __setitem__ (line 39) | def __setitem__(_item_self, key, value):
method __getitem__ (line 42) | def __getitem__(_item_self, key):
class Pipeline (line 46) | class Pipeline(object):
method __init__ (line 60) | def __init__(self, node, global_state=None):
method initialize (line 76) | def initialize(self, with_push=False):
method initialize_node (line 94) | def initialize_node(self, node, with_push=False):
method __getitem__ (line 130) | def __getitem__(self, name):
method __setitem__ (line 136) | def __setitem__(self, name_to_replace, replacement_node):
method __getattribute__ (line 173) | def __getattribute__(self, name):
method begin (line 197) | def begin(self):
method end (line 203) | def end(self):
method reset (line 210) | def reset(self):
method _reset (line 216) | def _reset(self):
method _begin (line 219) | def _begin(self):
method _end (line 224) | def _end(self):
method push (line 228) | def push(self, item):
method consume (line 239) | def consume(self, iterable):
method plot (line 251) | def plot(self, file_name='pipeline', kind='png'):
method __str__ (line 267) | def __str__(self):
method __repr__ (line 276) | def __repr__(self):
method _repr_svg_ (line 280) | def _repr_svg_(self): # pragma: no cover
FILE: consecution/tests/nodes_tests.py
function dot_installed (line 13) | def dot_installed():
class FakeDigraph (line 21) | class FakeDigraph(object): # pragma: no cover
method __init__ (line 22) | def __init__(self, *args, **kwargs):
method node (line 25) | def node(self, *args, **kwargs):
method edge (line 28) | def edge(self, *args, **kwargs):
method render (line 31) | def render(self, *args, **kwargs):
class NodeUnitTests (line 35) | class NodeUnitTests(TestCase):
method test_bad_logging_args (line 36) | def test_bad_logging_args(self):
method test_bad_top_down_make_repr_call (line 41) | def test_bad_top_down_make_repr_call(self):
method test_args_as_atts (line 46) | def test_args_as_atts(self):
method test_comparisons (line 50) | def test_comparisons(self):
method test_bad_flattening (line 60) | def test_bad_flattening(self):
method test_graphviz_not_installed (line 67) | def test_graphviz_not_installed(self):
method test_no_getitem (line 74) | def test_no_getitem(self):
method test_bad_slot_name (line 79) | def test_bad_slot_name(self):
class ExplicitWiringTests (line 86) | class ExplicitWiringTests(TestCase):
method setUp (line 87) | def setUp(self):
method tearDown (line 90) | def tearDown(self):
method do_wiring (line 93) | def do_wiring(self):
method do_explicit_wiring (line 96) | def do_explicit_wiring(self):
method do_graph_wiring (line 152) | def do_graph_wiring(self):
method test_connections (line 181) | def test_connections(self):
method test_all_nodes (line 228) | def test_all_nodes(self):
method test_top_node (line 237) | def test_top_node(self):
method test_duplicate_node (line 242) | def test_duplicate_node(self):
method test_acyclic (line 252) | def test_acyclic(self):
method test_multi_root (line 261) | def test_multi_root(self):
method test_non_node_connect (line 269) | def test_non_node_connect(self):
method test_write (line 275) | def test_write(self):
method test_write_bad_kind (line 285) | def test_write_bad_kind(self):
method test_bad_search_direction (line 290) | def test_bad_search_direction(self):
method test_bad_search_method (line 295) | def test_bad_search_method(self):
class DSLWiringTests (line 301) | class DSLWiringTests(ExplicitWiringTests):
method do_wiring (line 302) | def do_wiring(self):
class TopDownCallTests (line 306) | class TopDownCallTests(TestCase):
method test_call_order_okay (line 307) | def test_call_order_okay(self):
class BreadthFirstSearchTests (line 348) | class BreadthFirstSearchTests(TestCase):
method test_top_down_order (line 349) | def test_top_down_order(self):
method test_bottom_up_order (line 377) | def test_bottom_up_order(self):
class PrintingTests (line 405) | class PrintingTests(TestCase):
method setUp (line 406) | def setUp(self):
method test_nothing (line 447) | def test_nothing(self):
class RoutingTests (line 472) | class RoutingTests(TestCase):
method test_nothing (line 473) | def test_nothing(self):
FILE: consecution/tests/pipeline_tests.py
class Item (line 11) | class Item(object): # pragma: no cover (just a testing helper)
method __init__ (line 12) | def __init__(self, value, parent, source):
method build_source_list (line 17) | def build_source_list(self, source_list=None):
method get_path_string (line 24) | def get_path_string(self):
method __str__ (line 27) | def __str__(self):
method __repr__ (line 30) | def __repr__(self):
class TestNode (line 34) | class TestNode(Node):
method process (line 35) | def process(self, item):
class ResultNode (line 41) | class ResultNode(Node):
method process (line 42) | def process(self, item):
class BadNode (line 46) | class BadNode(Node):
method begin (line 47) | def begin(self):
method process (line 50) | def process(self, item): # pragma: no cover this should never get hit.
function item_generator (line 54) | def item_generator():
class TestBase (line 63) | class TestBase(TestCase):
method setUp (line 64) | def setUp(self):
class GlobalStateUnitTests (line 81) | class GlobalStateUnitTests(TestCase):
method test_kwargs_passed (line 82) | def test_kwargs_passed(self):
method test_printing (line 88) | def test_printing(self):
class OrOpTests (line 102) | class OrOpTests(TestCase):
method test_ror (line 103) | def test_ror(self):
class ManualFeedTests (line 117) | class ManualFeedTests(TestCase):
method test_manual_feed (line 118) | def test_manual_feed(self):
class PipelineUnitTests (line 136) | class PipelineUnitTests(TestCase):
method test_push_in_begin (line 137) | def test_push_in_begin(self):
method test_no_process (line 142) | def test_no_process(self):
method test_bad_route (line 150) | def test_bad_route(self):
method test_bad_node_lookup (line 163) | def test_bad_node_lookup(self):
method test_bad_replacement_name (line 169) | def test_bad_replacement_name(self):
method test_flattened_list (line 174) | def test_flattened_list(self):
method test_logging (line 183) | def test_logging(self):
method test_reset (line 199) | def test_reset(self):
class LoggingTests (line 221) | class LoggingTests(TestBase):
method test_logging (line 222) | def test_logging(self):
class ReplacementTests (line 236) | class ReplacementTests(TestBase):
method test_replace_first (line 237) | def test_replace_first(self):
method test_replace_even (line 252) | def test_replace_even(self):
method test_replace_no_router (line 267) | def test_replace_no_router(self):
class ConsumingTests (line 277) | class ConsumingTests(TestBase):
method test_even_odd (line 278) | def test_even_odd(self):
class ConstructingTests (line 298) | class ConstructingTests(TestBase):
method test_printing (line 299) | def test_printing(self):
method test_plotting (line 303) | def test_plotting(self):
class Batch (line 322) | class Batch(GroupByNode):
method begin (line 323) | def begin(self):
method key (line 326) | def key(self, item):
method process (line 329) | def process(self, batch):
class GroupByTests (line 333) | class GroupByTests(TestCase):
method test_batching (line 334) | def test_batching(self):
method test_undefined_key (line 342) | def test_undefined_key(self):
method test_undefined_process (line 352) | def test_undefined_process(self):
FILE: consecution/tests/testing_helpers.py
function print_catcher (line 7) | def print_catcher(buff='stdout'): # pragma: no cover
class Printer (line 20) | class Printer(object): # pragma: no cover
method __init__ (line 21) | def __init__(self):
method write (line 24) | def write(self, txt):
method lines (line 27) | def lines(self):
FILE: consecution/tests/utils_tests.py
class ClockTests (line 9) | class ClockTests(TestCase):
method test_bad_start (line 10) | def test_bad_start(self):
method test_printing (line 15) | def test_printing(self):
method test_get_time_of_running (line 34) | def test_get_time_of_running(self):
method test_pausing (line 44) | def test_pausing(self):
method test_stop_all (line 60) | def test_stop_all(self):
method test_reset_all (line 68) | def test_reset_all(self):
method test_double_calls (line 77) | def test_double_calls(self):
method test_get_time_delta_only (line 91) | def test_get_time_delta_only(self):
FILE: consecution/utils.py
class Clock (line 6) | class Clock(object):
method __init__ (line 7) | def __init__(self):
method running (line 13) | def running(self, *names):
method paused (line 19) | def paused(self, *names):
method start (line 24) | def start(self, *names):
method stop (line 32) | def stop(self, *names):
method reset (line 41) | def reset(self, *names):
method get_time (line 51) | def get_time(self, *names):
method __str__ (line 74) | def __str__(self):
method __repr__ (line 85) | def __repr__(self):
FILE: docker/simple_example.py
class N (line 9) | class N(Node):
method process (line 10) | def process(self, item):
FILE: docs/conf.py
function get_version (line 8) | def get_version():
function process_django_model_docstring (line 140) | def process_django_model_docstring(app, what, name, obj, options, lines):
function setup (line 177) | def setup(app):
FILE: setup.py
function read (line 11) | def read(path, encoding='utf-8'):
function version (line 17) | def version(path):
Condensed preview — 29 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (128K chars).
[
{
"path": ".coveragerc",
"chars": 29,
"preview": "[report]\nshow_missing = True\n"
},
{
"path": ".gitignore",
"chars": 16,
"preview": ".DS_Store\n*.pyc\n"
},
{
"path": ".travis.yml",
"chars": 320,
"preview": "sudo: false\nlanguage: python\npython:\n - '2.7'\n - '3.4'\n - '3.5'\n - '3.6'\n - '3.7'\ninstall:\n - pip install -e .[dev"
},
{
"path": "LICENSE",
"chars": 1545,
"preview": "Copyright (c) 2015, Robert deCarvalho\r\nAll rights reserved.\r\n\r\nRedistribution and use in source and binary forms, with o"
},
{
"path": "README.md",
"chars": 26288,
"preview": "Update (2/23/2021)\n===\nIt looks like this README is slowly turning into a reference of all the projects in this space th"
},
{
"path": "consecution/__init__.py",
"chars": 179,
"preview": "# flake8: noqa\nfrom consecution.nodes import Node, GroupByNode\nfrom consecution.pipeline import Pipeline, GlobalState\nfr"
},
{
"path": "consecution/nodes.py",
"chars": 24831,
"preview": "import sys\nfrom collections import Counter, deque, OrderedDict\nimport traceback\nfrom consecution.utils import Clock\n\n\ncl"
},
{
"path": "consecution/pipeline.py",
"chars": 9782,
"preview": "import sys\nfrom consecution.nodes import GroupByNode\n\n\nclass GlobalState(object):\n \"\"\"\n GlobalState is a simple co"
},
{
"path": "consecution/tests/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "consecution/tests/nodes_tests.py",
"chars": 13662,
"preview": "import os\nfrom collections import namedtuple\nimport shutil\nimport tempfile\nfrom unittest import TestCase\nimport subproce"
},
{
"path": "consecution/tests/pipeline_tests.py",
"chars": 10199,
"preview": "from __future__ import print_function\nfrom collections import namedtuple, Counter\nfrom unittest import TestCase\nfrom con"
},
{
"path": "consecution/tests/testing_helpers.py",
"chars": 809,
"preview": "import sys\nfrom contextlib import contextmanager\n\n\n# These don't need to covered. They are just tesing utilities\n@conte"
},
{
"path": "consecution/tests/utils_tests.py",
"chars": 2731,
"preview": "from __future__ import print_function\n\nfrom unittest import TestCase\nfrom consecution.utils import Clock\nimport time\nfro"
},
{
"path": "consecution/utils.py",
"chars": 2647,
"preview": "from collections import Counter\nfrom contextlib import contextmanager\nimport datetime\n\n\nclass Clock(object):\n def __i"
},
{
"path": "docker/Dockerfile",
"chars": 567,
"preview": "FROM ubuntu:xenial\n\n# root is the home directory\nWORKDIR /root\n\nADD simple_example.py /root/simple_example.py\n\n# set up "
},
{
"path": "docker/docker_build.sh",
"chars": 52,
"preview": "#! /usr/bin/env bash\n\ndocker build . -t consecution\n"
},
{
"path": "docker/docker_run.sh",
"chars": 89,
"preview": "#! /usr/bin/env bash\n\ndocker run -it --rm -v $(pwd):/root/shared consecution /bin/bash\n"
},
{
"path": "docker/simple_example.py",
"chars": 360,
"preview": "#! /usr/bin/env python\n\n# TODO: make the consecution install in the docker file read from pip\nfrom __future__ import pri"
},
{
"path": "docs/Makefile",
"chars": 5964,
"preview": "# Makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line.\nSPHINXOPTS =\nSPHINXBUILD "
},
{
"path": "docs/conf.py",
"chars": 5923,
"preview": "# -*- coding: utf-8 -*-\n#\nimport inspect\nimport os\nimport re\n\n\ndef get_version():\n \"\"\"Obtain the packge version from "
},
{
"path": "docs/index.rst",
"chars": 753,
"preview": "\nOverview\n=============================\nConsecution is:\n * An easy-to-use pipeline abstraction inspired by\n `Apache "
},
{
"path": "docs/ref/consecution.rst",
"chars": 12241,
"preview": ".. _ref-consecution:\n\nAPI documentation\n==================\n\nNode\n----\nNodes are the fundamental processing unit in conse"
},
{
"path": "docs/toc.rst",
"chars": 95,
"preview": "Table of Contents\n=================\n\n.. toctree::\n :maxdepth: 2\n\n index\n ref/consecution\n"
},
{
"path": "pandashells.md",
"chars": 673,
"preview": "Pandashells One-liner Example\n===\n\n<a href=\"https://github.com/robdmc/pandashells\">Pandashells</a> lets you use <a\nhref="
},
{
"path": "publish.py",
"chars": 336,
"preview": "import subprocess\n\nsubprocess.call('pip install wheel'.split())\nsubprocess.call('python setup.py clean --all'.split())\ns"
},
{
"path": "sample_data.csv",
"chars": 167,
"preview": "gender,age,spent\nmale,11,39.39\nfemale,10,34.72\nfemale,15,40.02\nmale,19,26.27\nmale,13,21.22\nfemale,40,23.17\nfemale,52,33."
},
{
"path": "setup.cfg",
"chars": 672,
"preview": "[nosetests]\nnocapture=1\nverbosity=1\nwith-coverage=1\ncover-branches=1\n#cover-min-percentage=100\ncover-package=consecution"
},
{
"path": "setup.py",
"chars": 1776,
"preview": "#!/usr/bin/env python\n\nimport io\nimport os\nimport re\nfrom setuptools import setup, find_packages\n\nfile_dir = os.path.dir"
}
]
// ... and 1 more files (download for full content)
About this extraction
This page contains the full source code of the robdmc/consecution GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 29 files (119.8 KB), approximately 29.6k tokens, and a symbol index with 203 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.