Motivation

Most developers have to write programs that read and process items of stuff. Be it products, feed-items, images, you name it. I’ve certainly had to. Apprentice programmers would just write it all down sequentially, but – as many fellow programmers can testify – this approach leads to maintenance problems due to a mixing of responsibilities in one place. ((Wikipedia: Separation of concerns))

Pipe and filters architectures are a natural fit to this problem. They allow you to separate out responsibilities in distinct processing units. While this may lead to an increase in perceived complexity, it actually reduces the complexity you have to handle by confining it to some other level.

One especially brilliant example of this type of architecture is UNIX’ pipe system.

Disclaimer: For this post I will show you the outline of a simple yet extensible feed (RSS or other) parsing system. The implementation of the concrete parts is left as an excercise to the reader. 🙂

A naive solution with lots of boilerplate

We want to separate out processing steps (to stay maintainable) so we need to agree on an interface to pass information from one step to the next. We just start with simple class with some hook-methods.

[code language=”python”] # WARNING, Anti-Example!
# Do not use. Better code will come up below!!1!

class ProcessingStep(object):
def __init__(self):
self.results = []

def add(self, items):
for item in items:
# This is built that way so that self.process()
# can either drop or add items to the results.
# (self.process() has to return the item again
# if it should be preserved, an empty list if
# it should be dropped)

for result in self.process(item):
self.results.append(result)

def get(self):
return self.results

def process(self, item):
raise NotImplementedError
[/code]

Creating new steps is trivially easy now. We just have to subclass ProcessingStep, override process() and we’re done.

[code language=”python”] class IdentityStep(ProcessingStep):
def process(self, item):
return [item]

# We need some management class which will control our pipeline
# and feed the items through it.
class Pipeline(object):
def __init__(self, steps):
self.steps = steps

def feed(self, items):
# Feed the item into one step, get the result, feed the
# result to the next step and so on.
for step in steps:
step.add(items)
items = step.get()

return items
[/code]

Nice! This solution looks very clean and simple. It has a reasonable small line count and no magic. It still has problems though.

Reality catches up

When implementing the feed parser we identified our basic steps: parsing the feed, processing each article, downloading images, storing the articles in our database.

Sometimes we want code to run only once in the process though, be it before the items get passed or afterwards. Examples are caching, database access, etc. So to achieve that we’ll have to change either the ProcessingStep or Pipeline class. We would add a before_loop() and probably a after_loop() hook method to every step.

Regardless, our next requirement would be: Buffer 500 entries per step to enable a grouped database flush. Grouping by some arbitrary key would also be nice, wouldn’t it? Although it’s actually quite simple to extend this solution, the base-class will just grow and grow with every feature. Testing it thorougly will also take some time. Another objection I gave myself was that these patterns look so common they ought to be solved already. It just doesn’t feel right to code at this level in Python.

This was the point where I decided to look at the itertools ((Python 2 documentation: itertools — Functions creating iterators for efficient looping)) module again. To my delight many of my desired patterns were already implemented, present in the recipe section of the docs, or could be coded up quickly without much boilerplate.

Putting the band back together

We now have seen some individual components of the whole pipeline system, but to actually use the system we need two more (very simple) functions.

The first one actually constructs our processing-pipeline. It’s implemented in a very simple way and this actually was my first implementation of it. I may, in later blog-posts, elaborate a bit about the subsequent evolution of this particular function.

[code language=”python”] def combine_pipeline(source, pipeline):
"""Combine source and pipeline and return a generator.

This is basically the same as the following function:

>>> def combine_pipeline_explicit(source, pipeline):
… gen = source
… for step in pipeline:
… gen = step(gen)
… return gen

"""
return reduce(lambda x, y: y(x), pipeline, source)
[/code]

The second one we use to kickstart the combined pipeline. Please note that, since we use generators here, the pipeline can only be run once. To reset the pipeline, just call combine_pipeline again.

[code language=”python”] def consume(iter):
"""Consume an iterator, trowing away the results.

You may use this if you want to trigger some code in a
generator but don’t want to actually know the result because
everything happens as a "side-effect" of the generator running.

There are better ones, but this one is simple to explain.
Read the itertools recipe section for a more performant
implementation of this.
"""
for _ in iter:
pass
[/code]

Then you can use it like that:

[code language=”python”] pipeline_steps = [
parse_xml,
download_images,
chunk(32), # This one combines images into
# chunks of 32 and passes these
# on to `save_database`. This is
# useful to reduce the accumulation
# of latency-costs of the db-access.
save_to_database
]

source = open("feed.xml", "r")
pipeline = combine_pipeline(source, pipeline_steps)
consume(pipeline)
# You may also iterate yourself over the pipeline, e.g.:
#
# for entry in pipline:
# print entry
#
[/code]

And that was it. Every entry coming out of pipeline will have gone through every processing step along the way. Any additional processing steps can be simply added/removed/replaced just by putting it in the pipeline_steps list. Neat.

Conclusion

I’m very happy with this system and although it changed a bit and gathered many processing steps, the core of it is exactly as presented here. The pipeline length in our system is nearing 30 distinct steps now and it is thoroughly tested (~95% coverage). The longest processing step has around 70 SLOC.

At first I was wary because I knew that generators can be detrimental to performance because of the function-call overhead inherent in every iteration, yet I found that the performance of the generators were negligable in comparison to the run-time of the steps themselves.

Aside: While writing this blog-post I found it very hard to (re-)create the initial (anti-)example in the first section. Knowing a solution to a problem seems to make you blind for other ones.

Discuss it on Hacker News: https://news.ycombinator.com/item?id=6058366

On Reddit: http://www.reddit.com/r/Python/comments/1imm6s/pipes_and_filters_architectures_with_python/

Ah, by the way: we’re hiring

Please have a look at http://www.stylight.de/Jobs/ to see if there’s any position that might appeal to you! It’s great to work for this company and Munich is a great city. So, come join us! 🙂

19 Comments

Vasudev Ram on July 20, 2013 9:49 pm

Interesting post.

Might want to check out my pipe_controller tool, which is very roughly for the same area – doing “pipes” in Python.

https://bitbucket.org/vasudevram/pipe_controller

and

http://jugad2.blogspot.com/search/label/pipe_controller

– Vasudev

- Christoph Rauch on July 22, 2013 7:32 am
  
  Thanks for your feedback. That’s similar to how I wanted to do it initially, but I found using generators is just more flexible as processing-steps can hold local state without requiring the use of classes as pipeline-steps. It’s just so much less code that way. You should definitely try it. An adapter-generator for already existing call-every-step functions would look something like this:
  
  [code language=”python”]
  def callable_step(func):
  assert callable(func)
  def pipeline_step(iterator):
  for entry in iterator:
  yield func(entry)
  return pipeline_step
  [/code]
  
  Also you might want to make a setup.py so other people can use it easily if they see fit.
  
Christian Harms on July 22, 2013 8:27 am

Some years ago I coded an internal bulk framework like yours. We got better read-ability (for-if-for-constructs) and the great test coverage (testing each step).

But a main feature was to speed up steps like your “download_image”. Instead use direct iterators I built it based on the Queue.Queue() (between each processing step). This offers the possibility to span many threads for IO-related steps.

Ah, by the way: Why are no python-offers for stuttgart?

- Christoph Rauch on July 22, 2013 9:06 am
  
  Right now our pipeline doesn’t contain any IO-related steps (it’s basically CPU bound), so I can’t really use threads right now to speed things up. On top of that, because of the nature of the problem, some of the steps need to hold state and resist being split up (neatly).
  
  I sucessfully used multiprocessing.Queue to fan-out CPU bound jobs to a worker-pool once, but I’m still searching for the right abstraction level. Also, the jobs currently run in Celery which brings in some new problems. 😐
  
  As for Python in Stuttgart: My observation is that Python seeps into Germany starting with the bigger cities. First cities with major Python job opportunities were Hamburg and Berlin. Munich is now catching up. I expect some smaller cities to follow this pattern over the next years.
  
sarbjit on July 25, 2013 4:13 pm

Nice Article.

Antonio Lima on July 26, 2013 2:14 pm

Neat solution, excellent article. I love how clean and powerful is the consume_pipeline function. I think it deserves a few lines of explanations in the article. 🙂

- Christoph Rauch on July 26, 2013 3:59 pm
  
  Thanks.
  
  I think consume_pipeline it’s pretty much self-explanatory. 🙂 I chose this one because I didn’t want to have to explain this
  
  [code language=”python”]
  def consume(iterator, n):
  "Advance the iterator n-steps ahead. If n is none, consume entirely."
  # Use functions that consume iterators at C speed.
  if n is None:
  # feed the entire iterator into a zero-length deque
  collections.deque(iterator, maxlen=0)
  else:
  # advance to the empty slice starting at position n
  next(islice(iterator, n, n), None)
  [/code]
  
  Perhaps in a later post.
  
Antonio Lima on July 26, 2013 11:21 pm

Oops, I meant to write the combine_pipeline, not consume:

[code lang=”python”]
def combine_pipeline(source, pipeline):
return reduce(lambda x, y: y(x), pipeline, source)
[/code]

- Christoph Rauch on July 29, 2013 7:28 am
  
  Yes, true. I’ll ~~update~~ updated the post with a bit of explanation. 🙂
  
Pingback: Vrac de mini-messages n°11 : Debian Project Leader, OVH, build de Firefox, rachat de Sourcefire, Python, PyPy, Libreoffice et Pychef | Le blog de Carl Chenet
Stéphane on July 31, 2013 5:41 pm

I dont get why you write:
[code language=”python”]
list(range(chunk_size))
[/code]
instead of simple :
[code language=”python”]
range(chunk_size)
[/code]
in the internal chunker function.

- Christoph Rauch on August 1, 2013 7:20 am
  
  You are right. I thought that since Python3 almost everything got converted into lazy sequences by default (.items instead of .iteritems, etc.) range would be lazy as well and thus less performant. But I haven’t measured it, so I shouldn’t further complicate the code that way.
  
  I’ll update the code and drop the list call. Good catch.
  
Jelle Smet on August 26, 2013 8:43 am

Hi Cristoph,

Very interesting approach and ideas here.

I’ve been working myself on a similar idea/framework using coroutines and queues focused towards IO pipeline servers:

https://wishbone.readthedocs.org/en/latest/

Thanks for sharing,

Jelle

Balthazar Rouberol on November 12, 2013 3:22 pm

Thanks for the very informative post!

I however needed the pipeline to return the final value. In order to do so, I modified your code a little bit.

[code language=”python”]
def consume(self, return_result=False):
"""Consume an iterator, trowing away the results.

You may use this if you want to trigger some code in a
generator but don’t want to actually know the result because
everything happens as a "side-effect" of the generator running.

There are better ones, but this one is simple to explain.
Read the itertools recipe section for a more performant
implementation of this.

Note: if return_result argument is True, the result will be returned,
instead of being thrown away

"""
for result in self.combine_pipeline():
if self.return_result:
return result
[/code]

- Christoph Rauch on November 12, 2013 3:46 pm
  
  Fair enough. Although in your case I would drop the boolean parameter (which is a code smell) altogether and always return the last result. It’s easier and doesn’t hurt at all.
  
  - Balthazar Rouberol on November 12, 2013 4:25 pm
    
    Sorry, there was a small typo in my previous comment. The “return_result“ argument must be passed to the “__init__“ method, and not to the “consume“ one.
    
Antonio Lima on August 2, 2014 12:59 pm

Hi Christoph,

I loved your solution so much that I modified it slightly and turned it into a small python library, named tubo. Check it out:
https://pypi.python.org/tubo

Antonio Lima on August 2, 2014 1:00 pm

Oops… correct link: https://pypi.python.org/pypi/tubo/

- rauchc on August 7, 2014 1:43 pm
  
  Thanks for the notice. 🙂 Glad you like it.

Pipes and filters architectures with Python generators (and other iterators)