Inspired by my foray into OData land, I decided to extend the everest query language to support expressions with arbitrary logical operators. During the implementation of this feature, I learned a few things about parsing query expressions in Python which I thought I could share here.

Query strings in everest consist of three main parts: The filtering part which determines the subset to select from the queried resource; the ordering part specifying the sorting order of the elements in the selected subset; and the batching part used to further partition the selected subset into manageable chunks.

So far, the filtering expressions in everest had consisted of a flat sequence of attribute:operator:value triples, separated by tilde (“~“) characters. The query engine interprets tilde characters as logical AND operations and multiple values passed in a criterion as logical OR operations.

The filtering expression grammar was written using the excellent pyparsing package from Paul McGuire (API documentation can be found here). With pyparsing, you can not only define individual expressions in a very readable manner, but also define arbitrary parsing actions to modify the parsing tree on the fly. Let’s illustrate this with a simplified version of the filter expression grammar that permits only integer numbers as criterion values:

Unlike a regular expression, this is largely self-explanatory. Note how we can give an expression a name using the setName method and reuse that name in subsequent expressions and how the suppress method can be used to omit parts of a match from the enclosing group.

The following transcript from an interactive Python session shows this grammar in action:

Not bad, but wouldn’t it be nice if the number literals were converted to integers? This is where parsing actions come in handy; to try this out, update the definition of the number expression like this:

Running our parsing example with the updated grammar will now produce this output:

As desired, the criteria values are returned as Python integers now.

As nifty as these parsing actions are, there is one thing you should know about them: Your callback code should never raise a TypeError. Let me show you why; first, we change the convert_number callback to raise a TypeError like this:

Now, we parse our query string again:

What happened?! The error message seems to suggest that something in your parsing expression went wrong, causing the parser not to pass anything into the parse action callback. Never would you suspect the callback itself since it looks like that has not even been called yet. What is really going on, though, is a piece of dangerous magic in the pyparsing which allows you to define your callbacks with different numbers of arguments:

The wrapper is calling our callback repeatedly with fewer and fewer arguments, expecting a TypeError if the number of arguments passed does not match funcs signature. If your callback itself raises a TypeError, the wrapper gives up and re-raises the last exception – which then produces the misleading traceback shown above. I think this is an example where magic introduced for convenience is actually causing more harm than good [1].

Let us now turn to the main topic of this post, adding support for arbitrary query expressions with explicit AND and OR operators. We start by adding a few more elements to our grammar:

The new AND and OR logical operators replace the tilde operator from the original grammar. Giving the AND operator precedence over the OR operator and calling the composition of two criteria a “junction”, we can now extend our query expression as follows:

Testing this with a slightly fancier query string:

Note that the result indeed reflects the operator precedence rules. Let us briefly check if we can still match our simple tilde-separated criteria query strings:

Oops – that did not go so well. The problem is that both the junctions and the delimitedList expression match a single criterion, so the parser stops after extracting the first of the two criteria. We can solve this issue easily by putting the simple criteria expression first and changing it such that it requires at least two criteria:

Checking:

Lovely. By now, we only have one problem left: How can we override the precedence rules and form the OR junction before the AND junction in the above example? To achieve this, we introduce open (“(“) and close (“)“) parentheses as grouping operators and make our junctions expression recursive as follows:

The key element here is the forward declaration of the junctions expression which allows us to use it in the definition of a junction_element before it is fully defined.

Let’s test this by adding parentheses around the OR clause in the query string above:

Great, it works! Beware, however, the common pitfall of introducing “left recursion” into recursive grammars: If you use a forward-declared expression at the beginning of another expression declaration, you instruct the parser to replace the former with itself until the maximum recursion depth is reached. Luckily, pyparsing provides a handy validate method which detects left-recursion in grammar expressions. With the following code fragment used in place of the junction_element expression above, the grammar will no longer load but instead fail with a RecursiveGrammarException:

I hope this post has given you a taste of the wealth of possibilities that pyparsing offers; if you are curious, the full everest filter parsing grammar can be found here. I would recommend pyparsing to anyone trying to solve parsing problems of moderate to high complexity in Python, with the few caveats I have already mentioned.

Footnotes    (↵ returns to text)
  1. As Paul’s comment below suggests, he is considering to remove this contentious piece of magic in the upcoming 3.x series.

2 thoughts on “Parsing query expressions in Python

  1. Great to read your blog entry on using pyparsing to help address your everest query language. A couple of points:

    operatorPrecedence implicitly handles precedence overrides using ()’s, your extra handling with openParen and closeParen are unnecessary
    There is API documentation, generated using epydoc, to be found at https://pythonhosted.org/pyparsing/ . However, I sympathize that creating parsers is a non-trivial exercise, and more usage guides would definitely help.
    It appears you are writing this code “and_op = CaselessKeyword(AND_PAT).setParseAction(replaceWith(AND_PAT))” to ensure that you get a consistent return value when parsing “AND”, “And”, “and”, or “aNd”. CaselessKeyword already does this, returning the original initialization string regardless of what case is encountered in the input string.
    The updated version of _trimArity has certainly caused some user distress. As it was contributed by Raymond Hettinger, I was loath to just discard it. Perhaps when I come out with Pyparsing 3.0 and can break backward compatibility, I’ll redefine parseActions to require the arguments to be in (tokens, location, inputString) order (although I think I will still allow location and inputString to be optional, as I see 90% of parse actions only require the parsed tokens). I am sorry that your having to explain this was something of a buzzkill for your otherwise positive article.

    Best of luck to you in your continued development on everest!

    — Paul

    Reply
    • Thanks, Paul, for your comments! I edited the blog post and followed your suggestions to improve the parsing expressions (see this changeset). And no worries about the _trimArity issue – I think it served as a good show case for unintended side effects of a well meaning code change aiming at improving usability.

      Reply

Leave a reply to Oliver Cancel reply

required

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">