Admittedly, the term “Life Sciences Discovery Informatics” which figures so prominently in the tag line of this blog is a little unwieldy. Here, I will make a brief attempt at explaining why we chose it anyways.

Let us start with a definition of the term “Life Sciences”. Wikipedia makes the following helpful suggestion:

“The life sciences comprise the fields of science that involve the scientific study of living organisms, such as plants, animals, and human beings, as well as related considerations like bioethics. While biology remains the centerpiece of the life sciences, technological advances in molecular biology and biotechnology have led to a burgeoning of specializations and new, often interdisciplinary, fields.”

Reading only the first sentence, one would think that the Life Sciences are actually nothing more than good old biology. Even after finishing the second sentence one cannot help but think that it was probably the refusal on the part of scientists coming from such illustrious disciplines as physics and chemistry to call themselves “biologists” that drove the invention of this now so ubiquitous term. In any event, Life Sciences are hip these days and the research results from its various subdisciplines drive the development of countless commercial applications in the biotechnology, pharmaceutical, and healthcare industries.

Conducting research in the Life Sciences, however, is notoriously difficult: Experiments tend to have many factors and need many replicates to account for the intrinsic complexity and variability of living systems. Also, experimental methods and designs are refined iteratively as insights into the system under study accumulate. With respect to building an IT infrastructure to support Life Sciences research operations, this translates to massive, complex data sets and frequently changing requirements. Naturally, standardization of data structures and processes tends to be difficult in such an environment and agility is key, both with respect to the software tools to use and the development methods to adopt.

Note that it is only the research – or “discovery” – domain within the large field of Life Sciences IT which poses this very special set of challenges. Large parts of the Healthcare industry, for instance, are tightly regulated, resulting in very different constraints on their supporting IT infrastructure.

There is a nascent field called “Discovery Informatics” which is devoted to applying computer science to advance discovery across all scientific disciplines. The field is so nascent, in fact, that Wikipedia has nothing to say about it. The best definition I could find is this one from William W. Agresti [1]:

“Discovery Informatics is the study and practice of employing the full spectrum of computing and analytical science and technology to the singular pursuit of discovering new information by identifying and validating patterns in data.”

It is at the intersection of Life Sciences and Discovery Informatics where this blog is trying to make a contribution – and, despite its length, the term “Life Sciences Discovery Informatics” seems the best way to describe this very special field.

Footnotes    (↵ returns to text)
  1. Communications of The ACM – CACM , vol. 46, no. 8, pp. 25-28, 2003

If, like me, you spend your day working at the command line (for me, mostly at the Bash and R shells in Linux) knowing what commands you’ve typed (and when and where) can be very useful.

Adding the following line (substitute ‘AGW’ for whatever you like) will result in all your commands entered at the Bash command line (along with date and filesystem location) being logged in a file (in this case, ~/.bash_history.AGWcomplete).

Each directory where commands are run also gets its own history file (.bash_history.AGW). Very useful if you want to know the genesis of a particular file in a directory.

Although I did not use it, S-Plus apparently had an audit function that would record commands executed on the shell into a log file. Auditing functionality is sometimes mentioned as a reason for the popularity of SAS in the pharmaceutical industry. But logging your work is of more general interest than meeting regulatory guidelines. In R, if you’re doing some initial exploration of a data set and your session crashes, your work will be lost, unless you remembered to savehistory(). Having a log also helps if you want to check how some analysis was performed.

I have a file (/AGW_functions.R) with custom functions (e.g. shortcuts to open tab-delimited files, sort data frames, run Gene Ontology enrichment analysis, …) that gets loaded when I start up an R session. The following 3 functions permit ‘auditing':

In my .Rprofile, the following lines load my custom R file at startup:

Now every command I execute in the R shell gets recorded in ~/audit.R.

ETA: While Googling to see if this page is already indexed, I found the following:

http://www.jefftk.com/news/2010-08-04

http://www.jefftk.com/news/2012-02-13

This fellow and I share much of the rationale for wanting to do this logging. Like him, I’m also surprised that more people are not doing it. I liked the following comment at the 2nd link. It’s a great illustration of how useful pervasive logging can be:

Thanks for this tip — I’ve been using it for about six months now and it’s definitely saved me a couple of times. (Just this morning I was trying to figure out where on earth I put an essential archive of data from a decommissioned server about a month ago, which I need to unpack onto a new server today — I couldn’t remember what I had named the archive or even whether I left it on my own computer or moved it to another server somewhere, so find wasn’t helpful. Eventually I thought to grep my ~/.full_history for “scp” and the IP of the decommissioned server. Definitely wouldn’t have found it any other way.)

Inspired by my foray into OData land, I decided to extend the everest query language to support expressions with arbitrary logical operators. During the implementation of this feature, I learned a few things about parsing query expressions in Python which I thought I could share here.

Query strings in everest consist of three main parts: The filtering part which determines the subset to select from the queried resource; the ordering part specifying the sorting order of the elements in the selected subset; and the batching part used to further partition the selected subset into manageable chunks.

So far, the filtering expressions in everest had consisted of a flat sequence of attribute:operator:value triples, separated by tilde (“~“) characters. The query engine interprets tilde characters as logical AND operations and multiple values passed in a criterion as logical OR operations.

The filtering expression grammar was written using the excellent pyparsing package from Paul McGuire (API documentation can be found here). With pyparsing, you can not only define individual expressions in a very readable manner, but also define arbitrary parsing actions to modify the parsing tree on the fly. Let’s illustrate this with a simplified version of the filter expression grammar that permits only integer numbers as criterion values:

Unlike a regular expression, this is largely self-explanatory. Note how we can give an expression a name using the setName method and reuse that name in subsequent expressions and how the suppress method can be used to omit parts of a match from the enclosing group.

The following transcript from an interactive Python session shows this grammar in action:

Not bad, but wouldn’t it be nice if the number literals were converted to integers? This is where parsing actions come in handy; to try this out, update the definition of the number expression like this:

Running our parsing example with the updated grammar will now produce this output:

As desired, the criteria values are returned as Python integers now.

As nifty as these parsing actions are, there is one thing you should know about them: Your callback code should never raise a TypeError. Let me show you why; first, we change the convert_number callback to raise a TypeError like this:

Now, we parse our query string again:

What happened?! The error message seems to suggest that something in your parsing expression went wrong, causing the parser not to pass anything into the parse action callback. Never would you suspect the callback itself since it looks like that has not even been called yet. What is really going on, though, is a piece of dangerous magic in the pyparsing which allows you to define your callbacks with different numbers of arguments:

The wrapper is calling our callback repeatedly with fewer and fewer arguments, expecting a TypeError if the number of arguments passed does not match funcs signature. If your callback itself raises a TypeError, the wrapper gives up and re-raises the last exception – which then produces the misleading traceback shown above. I think this is an example where magic introduced for convenience is actually causing more harm than good [1].

Let us now turn to the main topic of this post, adding support for arbitrary query expressions with explicit AND and OR operators. We start by adding a few more elements to our grammar:

The new AND and OR logical operators replace the tilde operator from the original grammar. Giving the AND operator precedence over the OR operator and calling the composition of two criteria a “junction”, we can now extend our query expression as follows:

Testing this with a slightly fancier query string:

Note that the result indeed reflects the operator precedence rules. Let us briefly check if we can still match our simple tilde-separated criteria query strings:

Oops – that did not go so well. The problem is that both the junctions and the delimitedList expression match a single criterion, so the parser stops after extracting the first of the two criteria. We can solve this issue easily by putting the simple criteria expression first and changing it such that it requires at least two criteria:

Checking:

Lovely. By now, we only have one problem left: How can we override the precedence rules and form the OR junction before the AND junction in the above example? To achieve this, we introduce open (“(“) and close (“)“) parentheses as grouping operators and make our junctions expression recursive as follows:

The key element here is the forward declaration of the junctions expression which allows us to use it in the definition of a junction_element before it is fully defined.

Let’s test this by adding parentheses around the OR clause in the query string above:

Great, it works! Beware, however, the common pitfall of introducing “left recursion” into recursive grammars: If you use a forward-declared expression at the beginning of another expression declaration, you instruct the parser to replace the former with itself until the maximum recursion depth is reached. Luckily, pyparsing provides a handy validate method which detects left-recursion in grammar expressions. With the following code fragment used in place of the junction_element expression above, the grammar will no longer load but instead fail with a RecursiveGrammarException:

I hope this post has given you a taste of the wealth of possibilities that pyparsing offers; if you are curious, the full everest filter parsing grammar can be found here. I would recommend pyparsing to anyone trying to solve parsing problems of moderate to high complexity in Python, with the few caveats I have already mentioned.

Footnotes    (↵ returns to text)
  1. As Paul’s comment below suggests, he is considering to remove this contentious piece of magic in the upcoming 3.x series.