Inspired by my foray into OData land, I decided to extend the everest query language to support expressions with arbitrary logical operators. During the implementation of this feature, I learned a few things about parsing query expressions in Python which I thought I could share here.
Query strings in everest
consist of three main parts: The filtering part which determines the subset to select from the queried resource; the ordering part specifying the sorting order of the elements in the selected subset; and the batching part used to further partition the selected subset into manageable chunks.
So far, the filtering expressions in everest
had consisted of a flat sequence of attribute:operator:value
triples, separated by tilde (“~
“) characters. The query engine interprets tilde characters as logical AND
operations and multiple values passed in a criterion as logical OR
operations.
The filtering expression grammar was written using the excellent pyparsing package from Paul McGuire (API documentation can be found here). With pyparsing
, you can not only define individual expressions in a very readable manner, but also define arbitrary parsing actions to modify the parsing tree on the fly. Let’s illustrate this with a simplified version of the filter expression grammar that permits only integer numbers as criterion values:
|
from pyparsing import * identifier = Word(alphas, alphanums + '-').setName('identifier') nonzero_nums = '123456789' number = Word(nonzero_nums, nums) values = delimitedList(number) colon = Literal(':') criterion = Group(identifier('attribute') + colon.suppress() + identifier('operator') + colon.suppress() + values('value')) tilde = Literal('~') query = delimitedList(criterion, tilde) def parse(query_string): return query.parseString(query_string) |
Unlike a regular expression, this is largely self-explanatory. Note how we can give an expression a name using the setName
method and reuse that name in subsequent expressions and how the suppress
method can be used to omit parts of a match from the enclosing group.
The following transcript from an interactive Python session shows this grammar in action:
|
$ python -i pyparsing_ex1.py >>> result = parse('age:equal-to:34~height:less-than:180') >>> result.dump() "[['age', 'equal-to', '34'], ['height', 'less-than', '180']]" |
Not bad, but wouldn’t it be nice if the number literals were converted to integers? This is where parsing actions come in handy; to try this out, update the definition of the number
expression like this:
|
def convert_number(toks): return int(toks[0]) number = Word(nonzero_nums, nums).setParseAction(convert_number) |
Running our parsing example with the updated grammar will now produce this output:
|
"[['age', 'equal-to', 34], ['height', 'less-than', 180]]" |
As desired, the criteria values are returned as Python integers now.
As nifty as these parsing actions are, there is one thing you should know about them: Your callback code should never raise a TypeError
. Let me show you why; first, we change the convert_number
callback to raise a TypeError
like this:
|
def convert_number(toks): raise TypeError() |
Now, we parse our query string again:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
|
>>> result = parse('age:equal-to:34~height:less-than:180') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "pyparsing_ex2.py", line 16, in parse return query.parseString(query_string) File "/Users/gathmann/virtualenv/dev_py27/lib/python2.7/site-packages/pyparsing.py", line 1021, in parseString loc, tokens = self._parse(instring, 0) File "/Users/gathmann/virtualenv/dev_py27/lib/python2.7/site-packages/pyparsing.py", line 894, in _parseNoCache loc, tokens = self.parseImpl(instring, preloc, doActions) File "/Users/gathmann/virtualenv/dev_py27/lib/python2.7/site-packages/pyparsing.py", line 2351, in parseImpl loc, resultlist = self.exprs[0]._parse(instring, loc, doActions, callPreParse=False) File "/Users/gathmann/virtualenv/dev_py27/lib/python2.7/site-packages/pyparsing.py", line 894, in _parseNoCache loc, tokens = self.parseImpl(instring, preloc, doActions) File "/Users/gathmann/virtualenv/dev_py27/lib/python2.7/site-packages/pyparsing.py", line 2623, in parseImpl return self.expr._parse(instring, loc, doActions, callPreParse=False) File "/Users/gathmann/virtualenv/dev_py27/lib/python2.7/site-packages/pyparsing.py", line 894, in _parseNoCache loc, tokens = self.parseImpl(instring, preloc, doActions) File "/Users/gathmann/virtualenv/dev_py27/lib/python2.7/site-packages/pyparsing.py", line 2368, in parseImpl loc, exprtokens = e._parse(instring, loc, doActions) File "/Users/gathmann/virtualenv/dev_py27/lib/python2.7/site-packages/pyparsing.py", line 894, in _parseNoCache loc, tokens = self.parseImpl(instring, preloc, doActions) File "/Users/gathmann/virtualenv/dev_py27/lib/python2.7/site-packages/pyparsing.py", line 2351, in parseImpl loc, resultlist = self.exprs[0]._parse(instring, loc, doActions, callPreParse=False) File "/Users/gathmann/virtualenv/dev_py27/lib/python2.7/site-packages/pyparsing.py", line 921, in _parseNoCache tokens = fn(instring, tokensStart, retTokens) File "/Users/gathmann/virtualenv/dev_py27/lib/python2.7/site-packages/pyparsing.py", line 675, in wrapper return func(*args[limit[0]:]) TypeError: convert_number() takes exactly 1 argument (0 given) |
What happened?! The error message seems to suggest that something in your parsing expression went wrong, causing the parser not to pass anything into the parse action callback. Never would you suspect the callback itself since it looks like that has not even been called yet. What is really going on, though, is a piece of dangerous magic in the pyparsing
which allows you to define your callbacks with different numbers of arguments:
|
def _trim_arity(func, maxargs=2): limit = [0] def wrapper(*args): while 1: try: return func(*args[limit[0]:]) except TypeError: if limit[0] <= maxargs: limit[0] += 1 continue raise return wrapper |
The wrapper
is calling our callback repeatedly with fewer and fewer arguments, expecting a TypeError
if the number of arguments passed does not match func
s signature. If your callback itself raises a TypeError
, the wrapper
gives up and re-raises the last exception – which then produces the misleading traceback shown above. I think this is an example where magic introduced for convenience is actually causing more harm than good [1].
Let us now turn to the main topic of this post, adding support for arbitrary query expressions with explicit AND
and OR
operators. We start by adding a few more elements to our grammar:
|
AND_PAT = 'and' OR_PAT = 'or' and_op = CaselessKeyword(AND_PAT).setParseAction(replaceWith(AND_PAT)) or_op = CaselessKeyword(OR_PAT).setParseAction(replaceWith(OR_PAT)) logical_op = and_op('operator') | or_op('operator') |
The new AND
and OR
logical operators replace the tilde operator from the original grammar. Giving the AND
operator precedence over the OR
operator and calling the composition of two criteria a “junction”, we can now extend our query expression as follows:
|
BINARY = 2 junctions = operatorPrecedence( criterion, [(and_op, BINARY, opAssoc.LEFT), (or_op, BINARY, opAssoc.LEFT), ]) query = delimitedList(criterion, tilde) | junctions def parse(query_string): return query.parseString(query_string) |
Testing this with a slightly fancier query string:
|
>>> result = parse('age:equal-to:34 OR height:less-than:180 AND weight:greater-than:70') >>> print result.dump() "[[['age', 'equal-to', 34], 'or', [['height', 'less-than', 180], 'and', ['weight', 'greater-than', 70]]]]" |
Note that the result indeed reflects the operator precedence rules. Let us briefly check if we can still match our simple tilde-separated criteria query strings:
|
>>> result = parse('age:equal-to:34~height:less-than:180') >>> result.dump() "[['age', 'equal-to', 34]]" |
Oops – that did not go so well. The problem is that both the junctions
and the delimitedList
expression match a single criterion, so the parser stops after extracting the first of the two criteria. We can solve this issue easily by putting the simple criteria expression first and changing it such that it requires at least two criteria:
|
simple_criteria = criterion + OneOrMore(tilde.suppress() + criterion) query = simple_criteria | junctions |
Checking:
|
>>> result = parse('age:equal-to:34~height:less-than:180') >>> result.dump() "[['age', 'equal-to', 34], ['height', 'less-than', 180]]" |
Lovely. By now, we only have one problem left: How can we override the precedence rules and form the OR
junction before the AND
junction in the above example? To achieve this, we introduce open (“(
“) and close (“)
“) parentheses as grouping operators and make our junctions
expression recursive as follows:
|
open_paren = Literal('(') close_paren = Literal(')') junctions = Forward() junction_element = (criterion | open_paren.suppress() + junctions + close_paren.suppress()) junctions << operatorPrecedence( junction_element, [(and_op, BINARY, opAssoc.LEFT), (or_op, BINARY, opAssoc.LEFT), ]) simple_criteria = criterion + OneOrMore(tilde.suppress() + criterion) query = simple_criteria | junctions |
The key element here is the forward declaration of the junctions
expression which allows us to use it in the definition of a junction_element
before it is fully defined.
Let’s test this by adding parentheses around the OR
clause in the query string above:
|
>>> result = parse('(age:equal-to:34 OR height:less-than:180) AND weight:greater-than:70') >>> print result.dump() "[[[['age', 'equal-to', 34], 'or', ['height', 'less-than', 180]], 'and', ['weight', 'greater-than', 70]]]" |
Great, it works! Beware, however, the common pitfall of introducing “left recursion” into recursive grammars: If you use a forward-declared expression at the beginning of another expression declaration, you instruct the parser to replace the former with itself until the maximum recursion depth is reached. Luckily, pyparsing
provides a handy validate
method which detects left-recursion in grammar expressions. With the following code fragment used in place of the junction_element
expression above, the grammar will no longer load but instead fail with a RecursiveGrammarException
:
|
junction_element = (criterion | junctions | open_paren.suppress() + junctions + close_paren.suppress()) junction_elemen.validate() |
I hope this post has given you a taste of the wealth of possibilities that pyparsing
offers; if you are curious, the full everest
filter parsing grammar can be found here. I would recommend pyparsing
to anyone trying to solve parsing problems of moderate to high complexity in Python, with the few caveats I have already mentioned.