The other day, after having secured a comfy corner chair in my favorite cafe, I had an exciting idea: What if you would apply the agile programming methodology to the burgeoning field of data science?

What had occurred to me was that none of the data science projects I have been involved with over the past years had started out with a solidly engineered normalized and referentially constrained database schema to store the data and with an algorithm of equally thorough design to perform analyses on them. Rather, there had always been a simple, functional prototype at the outset, say, a spreadsheet with macros from a domain expert or a CSV file and a script from an IT expert. As the project evolved, over many iterations, and often with a growing team in which domain and IT experts joined forces, the data set and the analysis algorithm had grown more sophisticated, and so had the tools and technologies that were applied in the process.

To conceptualize this workflow in terms of the agile software development framework with its focus on working code and iterative change in cross functional teams seemed like a very appealing idea to me; in fact, I realized that, having worked in agile developer teams for years, I had intuitively managed my data science projects the “agile way”.

It did not take long to find out that I was not the first to have this idea. On the contrary, “agile data” seems to be actually quite hip these days, with a dedicated web site and a whole book devoted to agile data science. Fascinated, I embarked on a survey of the available web resources on this topic; I was particularly curious to learn which tools the other practitioners of agile data science are using.

Not surprizingly, most of the agile data science folks seem to follow the established best practices of the general data sciences field and subscribe to the Hadoop ecosystem. However, I was pleased to see that a number of agile data scientists have, like myself, put KNIME, an Open Source data workflow analytics platform, at the center of their data analysis environment (see this article for a good introduction).

In my experience, KNIME is uniquely positioned for large data science projects because it allows all members of the team, IT and domain experts alike, to work together within the same environment. The IT experts can provide access to structured data using the database connector or, preferably, a web service[1], and the domain experts can choose from a huge collection of data manipulation, analytics and visualization nodes to generate the results they are looking for. There is also the option to delegate more complex or computationally expensive processing steps to R or Python, the two most widely used languages by data scientists. Finally, KNIME also offers a number of ways of scaling out to cope with large data volumes, including a (commercial) connector to Hadoop/HDFS file systems through Hive.

It seems almost as a contradiction to the spirit of the agile method to argue that all members of an agile data science team should use one particular tool – after all, the Agile Manifesto tells us to value “Individuals and interactions over Processes and tools”. However, KNIME is not so much a tool as an integration platform that gives you enormous freedom with regard to how you get the job done while making it transparent to everybody in the team what happens to the data at every step in the workflow. It is this combination of flexibility and transparency that makes KNIME such a great platform for agile data science projects.

To conclude, it was very revealing to me to re-frame my thinking about data science projects in terms of the agile programming paradigm and I am eager to apply these insights to my next KNIME project.

Footnotes    (↵ returns to text)
  1. Also see this blog post on using KNIME and REST for Life Sciences Discovery Informatics applications.

Many Life Sciences Discovery Informatics applications have to deal with some unpleasant combination of high data volume, high data velocity, and high data variety – the classic “3Vs of Big Data”. While applications that combine high values for all three Vs are rare in the Life Sciences – High-Content Screening (HCS) and Next-Generation Sequencing (NGS) come to mind – you can always rely on your input data to be variable, either in terms of the input formatting, or in terms of of the input data structures, or both. Moreover, in the vast majority of cases the data volume is too large to be handled properly with a collection Excel files, so a robust IT infrastructure for storing and validating the incoming data is required. In short, the average Life Sciences Discovery Informatics application needs to be very nimble and very robust at the same time.

In this post, I want to outline an application architecture that fits this bill exceptionally well – namely, the combination of KNIME with a RESTful server.

KNIME and REST

KNIME is a powerful and extensible platform for data analytics based on the concept of data analysis workflows where data flows (mostly) in tables from one data processing node to the next. With a vibrant and rapidly growing community built around its Open Source development model, the KNIME platform now offers more than 1000 different processing nodes from a wide variety of data analytics disciplines such as text processing, network analysis, and cheminformatics.

The typical KNIME workflow follows an “Extract, Transform, and View” approach, i.e., data is extracted from various sources, processed through some fancy analysis algorithm, and then visualized, often in an interactive and iterative fashion. Less common, but equally easy to do with KNIME, are workflows that follow the widely known “Extract, Transform, and Load” (ETL) approach where the result data from the analysis are pushed back into storage for subsequent reporting, possibly in an entirely automated cycle.

With a classical relational database backend, the “Load” part of the ETL cycle in KNIME is typically implemented using the builtin database connection nodes to perform appropriate database inserts. However, accessing the database layer directly is notoriously brittle (think schema changes) and is also not looked kindly upon in corporate environments (think end users editing INSERT statements). A more elegant, robust and safe approach is to wrap the load operation in a web service and submit the data from KNIME through a web service call – and this is where REST enters the picture.

In recent years, REST has become ubiquitous as the architecture of choice for web applications. Key to this phenomenal success are the concept of URL-addressable resources, the statelessness and uniform interface of all client-server interactions, and hypermedia. Portal sites like Mashape and companies like Apigee are among the most visible examples for this new paradigm of web application development.

Example application

I would like to illustrate what the KNIME & REST dream team can do with a simple application that allows KNIME users to execute arbitrary command line tools remotely in a convenient and secure fashion. This is only meant to show the basics of what a KNIME and REST based application architecture can do and it deliberately skips many of the implementation and installation details; please refer to the links in the footnotes for further information.

The REST service for the remote command execution application is called “telex” (short for “tele-execution” [1]). It exposes only two top-level resources, a ShellCommandDefinition resource and a ShellCommand resource. A new shell command definition is created with a POST request to the ShellCommandDefinition resource. If the telex server runs at http://telex in your local network, the POST request would go to http://telex/shell-command-definitions with the following JSON request body [2]:

The server responds with a HTTP 201 Created message and sends a representation of the newly created command in the body of the response.

The echo command takes a single parameter, the text to be echoed. To create a parameter definition for this parameter, we next perform a POST request to the nested ParameterDefinition resource at http://telex/shell-command-definitions/echo/parameter-definitions with this JSON request body:

Again, the server acknowledges the creation of the parameter definition resource with a HTTP 201 Created response.

With the echo command now operational, to run it we perform one more POST, this time to the Commands resource at http://telex/shell-commands with this JSON request body:

Once the command finished, the server returns the new ShellCommand resource containing the exit code of the program in the exit_code attribute, the output captured from stdout in the output_string attribute and the output captured from stderr in the error_string attribute [3]. Note that the ShellCommand resource gives you a complete record of who issued which command at what time, including parameters and output, which can come in very handy the next time you are trying to run the same command (and can be queried any time with a simple GET request).

To perform these REST operations in KNIME, we use the KREST extension from the trusted KNIME community site [4]. A simple workflow for the interactions with the telex server described above could look like this:

telex_workflow_tc

The user specifies the base URL of the telex service with the String Input node at the top and then composes the tables needed to generate the JSON representations for the desired command and parameter definitions using Table Creator nodes.

Manually entering data using the Table Creator node is not very user friendly, as the column names in the table must exactly match the attribute names in the resulting JSON representation [5]. To simplify this task, I wrote the Assisted Table Creator (ATC) node [6], which uses a RGG template to assist the user with a dialog for entering the parameter data. An RGG template is a simple text file; for example, the template for submitting an echo command as shown above looks like this:

During node configuration, this template is then translated to the following – very simple – data entry dialog:

Screen Shot 2014-10-18 at 17.46.28

Once the dialog is closed, the output table is generated which in turn can be converted to JSON and submitted to the telex server as in the example above.

But wait, there is more: With the RGG plugin for the telex server, you don’t even have to write these templates – they will be generated automatically for all telex commands. Technically, the RGG plugin just adds a new renderer which knows how to convert ShellCommandDefinition member resources into RGG templates. Once the plugin is installed, all it takes to make the telex commands available as auto-generated RGG templates in KNIME is to add the URL http://telex/shell-command-definitions/@@rgg in the preferences dialog of the Assisted Table Creator node [7].

The shell command execution API that the telex server provides is a very simple example demonstrating the power and versatility of combining a REST API with KNIME. Of course, the concept of passing parameters collected with an RGG template in KNIME as JSON payload can easily be transferred to calls into your own application server’s REST API. For the adventurous, the telex server also provides RestCommandDefinition and RestCommand resources that allow you to define and execute such REST calls just like calls to shell commands. In such a setup, the telex server can act as a portal to well defined REST services for your KNIME users, all conveniently configurable through auto-generated RGG templates.

I better stop now to keep this short; hopefully, this overview has sparked your interest in teaming up KNIME and REST to solve complex discovery informatics problems!

Footnotes    (↵ returns to text)
  1. The service was implemented as an everest application and is available here.
  2. The __jsonclass__ attribute is used internally by the telex server to infer the class of the POSTed object (“class hinting”).
  3. In the interest of brevity, I will skip discussing issues of error handling and timeouts with non-terminating commands here.
  4. The KREST nodes were developed at Cenix BioScience during a research project funded by the EU and the German Federal Ministry of Education and Research (BMBF).
  5. This includes a number of columns for internal use by the telex server such as the __jsonclass__ fields and dotted column names for nested attributes.
  6. This node is based on the excellent MPI scripting nodes and is available from this Eclipse update site.
  7. Provided you have the telex server including the RGG plugin and the ATC node set up, you can play with the workflow shown above after downloading it here.

In a previous post on unit testing I described the motivation behind our recent move to pytest as a unit testing framework and the experiences we made in the early phase of the migration. Here, I want to highlight a more advanced issue we encountered during the arduous work of porting our domain model unit tests to the new pytest
based framework: The issue of testing complex, hierarchical data structures.

The @pytest.parametrized decorator is very useful for setting up simple test objects on the fly, but since the decorator is called at module load time, you may not be able to create complex test objects with it. Also, the syntax is fairly cumbersome to read with more complex object initializations.

To illustrate the problem, consider the following simple test class diagram [1]:

myentity

Suppose we want to perform tests on MyEntity instances. Suppose further that we need also properly initialized MyEntityParent, MyEntityChild and MyEntityGrandchild instances attached to our test object. The standard way to achieve this would be a pytest fixture looking like this:

However, if we now needed a differently configured MyEntity instance – say, with two MyEntityChild instances or a different id, we would end up with a lot of code duplication.

After some time playing around with various ideas on how to make the generation of complex test object trees more flexible, I had the idea to simply add one layer of indirection by defining test object factory fixtures for the individual test object classes like this:

When used as a fixture parameter in a test function (or another fixture), calling these factories with no arguments always returns an instance fully initialized with the default arguments defined by the factory fixture:

Very convenient. However, it is also straightforward to create customized test object fixtures using these factories:

There is one thing to keep in mind here: The default behavior of the test object factories is to create only one instance for each unique combination of positional and keyword arguments. So, in the example above, since the children attribute was not customized in the calls to the my_entity_child_fac factory, both MyEntityChild instances created end up with the exact same MyEntityGrandchild instance in their children list. If you absolutely need new instances, you can either pass unique arguments to the factory or call its new method.

On another, perhaps slightly esoteric, side note I want to point out that the code above has the unnerving effect of immediately generating a pylint warning in line 10, code number W0621, informing you that a name from the outer scope in line 4 was just redefined. Getting rid of this warning without disabling it manually or moving the test fixture into a separate conftest module turned out to be not all that easy; I eventually resorted to the following way of declaring my fixtures at the top of my test modules:

and use one of the ingenious pytest hooks to pre-process the module at test collection time (this needs to go somewhere pytest can see it, e.g. in a conftest module in the test package):

Again, I am happy to report that pytest gives me all the tools needed to cope with our sometimes complex testing scenarios and that I do not regret our decision to migrate to pytest at all.

Footnotes    (↵ returns to text)
  1. These classes are taken from the test suite of everest)

With increasing complexity of the data structures and business logic of our main development project at work we are slowly arriving at a point where the current way of testing is simply no longer tenable: The whole test suite with close to 2000 tests takes about 3 hours to run now. Bad habits are starting to creep in – such as committing your changes without running a full test suite on your local machine and relying on your build system to tell you what you have broken with your last commit. Yes, really bad habits.

There are a number of approaches to improve this desolate situation (cf. also this blog post):

  • Speed up data access by staging all testing data in an in-memory database prior to the test run;
  • Introduce test tiers, with the simple (and, presumably, faster) model level tests always enabled while the more complex functional tests only run over night;
  • Avoid repeated test object construction through careful analysis of which test objects can be reused for which tests.

Ultimately, the solution will probably require a combination of all approaches listed above, but I decided to start with the latter one and to review our current testing framework along the way.

Our group is currently using a “classical” unittest.TestCase testing infrastructure with nose as the test driver. Carefully crafted set_up and tear_down methods in our test base classes ensure that the framework for our application (a REST application based on everest) is initialized and shut down properly for each test. Data shared between tests are kept in the instance namespace or in the class namespace of the test class itself or any of its super classes.

After an in depth review of our complex hierarchy of test classes I realized that it would be difficult to implement the desired flexibility in reusing test objects across tests because the unittest framework offers very limited facilities to separate the creation of test objects from the tests themselves. Looking for alternative frameworks, I quickly came across pytest, which promises to be more modular through rigorous use of dependency injection, i.e., passing each test function exactly the test objects (or “fixtures”) it needs to perform the test. I decided to give pytest a shot and the remainder of this post is about reporting on the experiences I made over the course of this experiment.

For easier porting of the existing tests, I started out replicating the functionality of the current testing base classes derived from unittest.TestCase with pytest fixtures. As it turned out, this made it also easy to understand the different philosophies behind these two testing environments. For example, the base class for tests requiring a Pyramid Configurator looked like this in the unittest framework:

BaseTestCaseWithConfiguration inherits from TestCaseWithIni which provides ini file parsing functionality. The Pyramid Configurator instance is created in the set_up method using parameters that are defined in the class namespace and stored in the test case instance namespace. To avoid cross talk between tests, the tear_down method deconstructs the configurator’s registry.

This test base class is used along the lines of the following contrived example:

Now, the equivalent pytest fixture and test module look like this:

While the mechanics of the pytest fixture has not changed very much (and is perhaps a tad harder to read because of the inline tear_down function), the test module has gotten a lot simpler: There is no need to derive from a base class and the properly initialized Configurator test object is passed automatically to the test function by the pytest framework. Moreover, while the pytest fixtures can depend on each other (in the example, the configurator fixture depends on the ini fixture which again is passed in automatically by pytest), they are much more modular than the TestCase classes and you can pull in whichever fixtures you need in a given test function.

Since pytest comes with unit test integration, most of our old tests ran right out of the box. However, there was no support for the __test__ attribute that nose offers to manually exclude base classes from testing; also, the unittest plugin of pytest does not automatically exclude classes with names starting with an underscore from testing like nose does. Fortunately, fixing these problems was trivial using one of the many, many pytest hooks:

To make the basic test fixtures for everest applications usable from other projects, I bundled them with the test collection hook above as a pytest plugin which I published using the setuptools entry point mechanics alongside the old nose plugin entry point in the everest.setup module like this:

Next, I needed to add support for the custom app-ini-file option that is used to pass configuration options, particularly the configuration of the logging system, to everest applications. This was also straightforward using the pytest_addoption and pytest_configure hooks:

If you now configure a console handler for your logger that uses a logging.StreamHandler to direct output to sys.stderr in an ini file and then pass this to the pytest driver with the app-ini-file option, the logging output from failed tests will be reported by pytest in the “Captured stderr” output section.

Finally, I needed to get code coverage working with pytest. The simplest way to achieve this seemed the pytest-cov plugin for pytest. However, I could not get correct coverage results for the everest sources when using this plugin, presumably because quite a few of the everest modules are loaded when the everest plugin for pytest is initialized, so I decided to run coverage separately from the command line which is not much of an inconvenience and perhaps the right thing to do anyways.

With this setup, we can now use the pytest test driver to run the existing test suite for our REST application while we gradually replace our complex test case class hierarchy with more modular pytest test fixtures that will ultimately help us cut down our test run time.

Of course, every piece of magic has its downsides. If your IDE pampers you with a feature to transport you to the definition of an identifier at the push of a button (F3 in pydev, for instance), then you might find it disconcerting that this will not work with the test fixtures that are magically passed by pytest. However, in most cases it should be very straightforward to find the definition of a particular fixture, since there are only a couple of places where you would sensibly put it (inside the test module if it is not shared across modules or in a confest.py module if it is).

In summary, I think that the benefits of using pytest far outweigh its downsides and I encourage everyone to give it a try.

Although the holiday season is sort of officially over (with Dreikoenigstag, on the 6th), I thought to sneak in a post whose title fits to the holiday season, and whose contents could be a tiny gift to some touch typists working on the command line.

Of late, I have started creating keyboard shortcuts that keep my fingers close to the home keys. This is especially useful as I use different keyboards during the day, each of which have different layouts for some crucial keys. I also don’t like reaching for things like escape, or the number keys, which are needed for Perl programming (e.g. 4-$).

My .vimrc file contains the following custom mappings. ‘:imap’ tells vi to interpret the first set of keys (e.g. ‘;x’) as another (‘escape + dd + insert’). Just add the mappings into the .vimrc file in your home directory and you can try them out. One thing you should do, though, is create an vim alias that points to a plain .vimrc file. That way you can paste in text that contains the mappings that you’ve created.

The first set help with general editing tasks. One of the main advantages is that you can do lots of editing without having to exit edit mode. I find this to be much faster than hitting escape, and then the appropriate key(s).

The second set turn vi into a poor man’s Eclipse. I’m sure there are other tricks one could use to save even more time (and there are others I haven’t shown, as they are not obvious and might only be of use to me).

You can also add the following to .inputrc and get keyboard shortcuts in any shell that uses the Readline library.

For example, the first one will create the R assignment operator with little finger movement. Apparently, the terminals on which R was developed had a key for ‘<-‘, so it wasn’t as awkward to use then. I do agree, as per the Google style guide, that ‘<-‘ is superior to ‘=’ (even from a purely aesthetic viewpoint). Happy typing!

Admittedly, the term “Life Sciences Discovery Informatics” which figures so prominently in the tag line of this blog is a little unwieldy. Here, I will make a brief attempt at explaining why we chose it anyways.

Let us start with a definition of the term “Life Sciences”. Wikipedia makes the following helpful suggestion:

“The life sciences comprise the fields of science that involve the scientific study of living organisms, such as plants, animals, and human beings, as well as related considerations like bioethics. While biology remains the centerpiece of the life sciences, technological advances in molecular biology and biotechnology have led to a burgeoning of specializations and new, often interdisciplinary, fields.”

Reading only the first sentence, one would think that the Life Sciences are actually nothing more than good old biology. Even after finishing the second sentence one cannot help but think that it was probably the refusal on the part of scientists coming from such illustrious disciplines as physics and chemistry to call themselves “biologists” that drove the invention of this now so ubiquitous term. In any event, Life Sciences are hip these days and the research results from its various subdisciplines drive the development of countless commercial applications in the biotechnology, pharmaceutical, and healthcare industries.

Conducting research in the Life Sciences, however, is notoriously difficult: Experiments tend to have many factors and need many replicates to account for the intrinsic complexity and variability of living systems. Also, experimental methods and designs are refined iteratively as insights into the system under study accumulate. With respect to building an IT infrastructure to support Life Sciences research operations, this translates to massive, complex data sets and frequently changing requirements. Naturally, standardization of data structures and processes tends to be difficult in such an environment and agility is key, both with respect to the software tools to use and the development methods to adopt.

Note that it is only the research – or “discovery” – domain within the large field of Life Sciences IT which poses this very special set of challenges. Large parts of the Healthcare industry, for instance, are tightly regulated, resulting in very different constraints on their supporting IT infrastructure.

There is a nascent field called “Discovery Informatics” which is devoted to applying computer science to advance discovery across all scientific disciplines. The field is so nascent, in fact, that Wikipedia has nothing to say about it. The best definition I could find is this one from William W. Agresti [1]:

“Discovery Informatics is the study and practice of employing the full spectrum of computing and analytical science and technology to the singular pursuit of discovering new information by identifying and validating patterns in data.”

It is at the intersection of Life Sciences and Discovery Informatics where this blog is trying to make a contribution – and, despite its length, the term “Life Sciences Discovery Informatics” seems the best way to describe this very special field.

Footnotes    (↵ returns to text)
  1. Communications of The ACM – CACM , vol. 46, no. 8, pp. 25-28, 2003

If, like me, you spend your day working at the command line (for me, mostly at the Bash and R shells in Linux) knowing what commands you’ve typed (and when and where) can be very useful.

Adding the following line (substitute ‘AGW’ for whatever you like) will result in all your commands entered at the Bash command line (along with date and filesystem location) being logged in a file (in this case, ~/.bash_history.AGWcomplete).

Each directory where commands are run also gets its own history file (.bash_history.AGW). Very useful if you want to know the genesis of a particular file in a directory.

Although I did not use it, S-Plus apparently had an audit function that would record commands executed on the shell into a log file. Auditing functionality is sometimes mentioned as a reason for the popularity of SAS in the pharmaceutical industry. But logging your work is of more general interest than meeting regulatory guidelines. In R, if you’re doing some initial exploration of a data set and your session crashes, your work will be lost, unless you remembered to savehistory(). Having a log also helps if you want to check how some analysis was performed.

I have a file (/AGW_functions.R) with custom functions (e.g. shortcuts to open tab-delimited files, sort data frames, run Gene Ontology enrichment analysis, …) that gets loaded when I start up an R session. The following 3 functions permit ‘auditing':

In my .Rprofile, the following lines load my custom R file at startup:

Now every command I execute in the R shell gets recorded in ~/audit.R.

ETA: While Googling to see if this page is already indexed, I found the following:

http://www.jefftk.com/news/2010-08-04

http://www.jefftk.com/news/2012-02-13

This fellow and I share much of the rationale for wanting to do this logging. Like him, I’m also surprised that more people are not doing it. I liked the following comment at the 2nd link. It’s a great illustration of how useful pervasive logging can be:

Thanks for this tip — I’ve been using it for about six months now and it’s definitely saved me a couple of times. (Just this morning I was trying to figure out where on earth I put an essential archive of data from a decommissioned server about a month ago, which I need to unpack onto a new server today — I couldn’t remember what I had named the archive or even whether I left it on my own computer or moved it to another server somewhere, so find wasn’t helpful. Eventually I thought to grep my ~/.full_history for “scp” and the IP of the decommissioned server. Definitely wouldn’t have found it any other way.)

Inspired by my foray into OData land, I decided to extend the everest query language to support expressions with arbitrary logical operators. During the implementation of this feature, I learned a few things about parsing query expressions in Python which I thought I could share here.

Query strings in everest consist of three main parts: The filtering part which determines the subset to select from the queried resource; the ordering part specifying the sorting order of the elements in the selected subset; and the batching part used to further partition the selected subset into manageable chunks.

So far, the filtering expressions in everest had consisted of a flat sequence of attribute:operator:value triples, separated by tilde (“~“) characters. The query engine interprets tilde characters as logical AND operations and multiple values passed in a criterion as logical OR operations.

The filtering expression grammar was written using the excellent pyparsing package from Paul McGuire (API documentation can be found here). With pyparsing, you can not only define individual expressions in a very readable manner, but also define arbitrary parsing actions to modify the parsing tree on the fly. Let’s illustrate this with a simplified version of the filter expression grammar that permits only integer numbers as criterion values:

Unlike a regular expression, this is largely self-explanatory. Note how we can give an expression a name using the setName method and reuse that name in subsequent expressions and how the suppress method can be used to omit parts of a match from the enclosing group.

The following transcript from an interactive Python session shows this grammar in action:

Not bad, but wouldn’t it be nice if the number literals were converted to integers? This is where parsing actions come in handy; to try this out, update the definition of the number expression like this:

Running our parsing example with the updated grammar will now produce this output:

As desired, the criteria values are returned as Python integers now.

As nifty as these parsing actions are, there is one thing you should know about them: Your callback code should never raise a TypeError. Let me show you why; first, we change the convert_number callback to raise a TypeError like this:

Now, we parse our query string again:

What happened?! The error message seems to suggest that something in your parsing expression went wrong, causing the parser not to pass anything into the parse action callback. Never would you suspect the callback itself since it looks like that has not even been called yet. What is really going on, though, is a piece of dangerous magic in the pyparsing which allows you to define your callbacks with different numbers of arguments:

The wrapper is calling our callback repeatedly with fewer and fewer arguments, expecting a TypeError if the number of arguments passed does not match funcs signature. If your callback itself raises a TypeError, the wrapper gives up and re-raises the last exception – which then produces the misleading traceback shown above. I think this is an example where magic introduced for convenience is actually causing more harm than good [1].

Let us now turn to the main topic of this post, adding support for arbitrary query expressions with explicit AND and OR operators. We start by adding a few more elements to our grammar:

The new AND and OR logical operators replace the tilde operator from the original grammar. Giving the AND operator precedence over the OR operator and calling the composition of two criteria a “junction”, we can now extend our query expression as follows:

Testing this with a slightly fancier query string:

Note that the result indeed reflects the operator precedence rules. Let us briefly check if we can still match our simple tilde-separated criteria query strings:

Oops – that did not go so well. The problem is that both the junctions and the delimitedList expression match a single criterion, so the parser stops after extracting the first of the two criteria. We can solve this issue easily by putting the simple criteria expression first and changing it such that it requires at least two criteria:

Checking:

Lovely. By now, we only have one problem left: How can we override the precedence rules and form the OR junction before the AND junction in the above example? To achieve this, we introduce open (“(“) and close (“)“) parentheses as grouping operators and make our junctions expression recursive as follows:

The key element here is the forward declaration of the junctions expression which allows us to use it in the definition of a junction_element before it is fully defined.

Let’s test this by adding parentheses around the OR clause in the query string above:

Great, it works! Beware, however, the common pitfall of introducing “left recursion” into recursive grammars: If you use a forward-declared expression at the beginning of another expression declaration, you instruct the parser to replace the former with itself until the maximum recursion depth is reached. Luckily, pyparsing provides a handy validate method which detects left-recursion in grammar expressions. With the following code fragment used in place of the junction_element expression above, the grammar will no longer load but instead fail with a RecursiveGrammarException:

I hope this post has given you a taste of the wealth of possibilities that pyparsing offers; if you are curious, the full everest filter parsing grammar can be found here. I would recommend pyparsing to anyone trying to solve parsing problems of moderate to high complexity in Python, with the few caveats I have already mentioned.

Footnotes    (↵ returns to text)
  1. As Paul’s comment below suggests, he is considering to remove this contentious piece of magic in the upcoming 3.x series.

Last night, I had a closer look at Microsoft’s OData protocol. The opening line on the introduction page reads: “The Open Data Protocol (OData) is a Web protocol for querying and updating data that provides a way to unlock your data and free it from silos that exist in applications today.”

This very much sounded like a solution for the problem we were facing with our last generation LIMS (Laboratory Information Management System) a few years back: A closed, monolithic system, accessible only through a native GUI client, which was difficult to maintain and to adapt to our perpetually changing requirements. In early 2010, we decided to tackle this data silo problem by redesigning the LIMS from scratch as a RESTful web services architecture. The project was a great success and earlier this year we released the generic components of this work as a Python package called everest.

Given this background, I was quite curious to learn what the OData folks have on offer and went on to study the data model and service specification documents. This turned out to be a highly fascinating read for two reasons:

Firstly, I realized that OData shares a lot of basic design decisions with everest: Both put uniform REST operations on the exposed data objects at the center and use ATOM as the main representation content type; OData “entities” correspond to everest “member resources”, “entity sets” to “collection resources”, “properties” to “attributes”, and “navigation properties” to “links”. Initially, I was quite flattered that a project as serious and widely known as OData would make a lot of the same design decisions as we did, but then I realized that most of these shared elements are actually “forced moves” in the sense that they are just the most sensible way to build a uniform RESTful web service framework.

The second thing that fascinated me about the OData specification were not the similarities with everest, but the differences. I will quickly point out a few of them that I found most striking, starting with OData features that are missing in everest:

  • Complex data types. The everest data model only distinguishes between resources (which may reference other, nested resources) and “terminal” data objects (which have a simple, atomic type). While this simplifies the protocol, it forces the service provider to expose all complex data types as full blown, addressable resources, which is not always desirable;
  • Query parameters for fine-tuning the response.. Specifically, the client can control which attributes of a resource should be included in the representation returned by the server (using the $select parameter) and whether they should be represented inline (using the $expand parameter) or as URLs (using the $links parameter). In everest, this can only be done statically for each combination of resource and representation;
  • Operations on resources. I can only guess that this part of the OData specification was added to make the transition from SOAP based web services easier. everest purposefully abstains from a “hybrid” service architecture (i.e., mixing REST with RPC-style operations). In our daily practice, we have yet to encounter a situation where (admittedly sometimes creative) use of the REST operations was not sufficient to implement the required application logic;
  • Support for PATCH.This is a really nifty feature – the ability to do partial updates for large resources is enormously useful;
  • Math and grouping operators for filter operations. This is also a neat feature as it provides a substantial extension of the realm of possible queries at relatively little cost.

There are also a few things that everest offers and OData does not:

  • Decoupling of resource and entity level. In everest, the resource layer is constructed explicitly on top of the entity domain model. Only entity attributes that are exposed through a resource attribute are visible to the client. This allows you to a) Expose a pre-existing entity domain model through a thin resource layer as a REST application; and b) Isolate changes in your entity domain model from the resource layer. Of course you could also perform such a mapping inside an OData application, but this would have to happen outside of the framework;
  • “in-range”, “contains” and “contained” operators for filter operations. Especially the “contained” operator is very handy in cases where you want to retrieve a whole collection of resources with one request;
  • CSV as representation content type. This is vital in our application domain (Life Sciences) where data import and export is still often manual (e.g., through Excel).

In the end, I came away deeply impressed with OData: The protocol specification leaves little to be desired and has been adopted by a thriving ecosystem of producers and consumers. I still think everest has a few things to offer, however: If you are already committed to OData, you could use it to reflect on OData‘s design (like I just did in the other direction); if, on the other hand, you are a Python-affine web developer looking for a RESTful framework to open up a number of data silos in your organization, everest might be able to supply all the functionality you need with very little overhead.

Last night, I was looking for a background image that would visualize the main theme of this blog to make it more appealing. Since we chose “Bits and Bases” as the tag line for the blog, an image showing random sequences of bits interspersed with random sequences of nucleotid bases seemed perfectly adequate. Not unexpectedly, a web search for suitable images did not turn up any useful hits, so I started to look for a programmatic way to create the desired symbol strings, preferably in my favorite language, Python.

There are at least a dozen different packages that would allow you to do this comfortably in Python, but I ended up with a slightly unusual choice: NodeBox, an application that allows you to create stunning 2D visuals with very little effort. What was particularly captivating about NodeBox was not only the beauty of the samples shown in the gallery, but the ease with which it allows you to explore its features interactively: Just copy&paste some of the sample code in the script editor panel, press F5 and review your rendered artwork on the display panel (or, if your script contains an error, decipher the strack trace in the message panel).

The script shown below runs inside the NodeBox environment which already has all the relevant drawing functions in the global namespace:

Pretty simple – but it is astonishing how different the results look with different symbol and background colors or with different font sizes.

Obviously, the script above only scratches the surface of what NodeBox can do. There is much more to explore within NodeBox (paths, transformations, images) and beyond (e.g., the fancy NodeBox 2 project which adds a graphical workflow layer on top of the NodeBox engine). Enjoy!