2018-12-08

Beyond Regular Expressions: Recursive Descent Parsing

Your eyes are bleeding, and trying to reliably extract information from C header files using regular expressions has driven you to the brink of insanity. You can admit that you’ve got a problem, but now what?

[This post was derived from a Jupyter notebook.]

1 Introduction

The last post discussed how regular expressions work, from a theoretical standpoint, and showcased the core of a working regex engine based directly on that theory. It also ended with a discussion of the limits of regular expressions and why no regex engine can parse regular expressions–a capability we’d need to create a working front end that can accept regexen as a first step to doing interesting things with them (like synthesizing hardware or screen printing state machines on tie-dyed T-shirts). This post explores some options for when regular expressions are simply not enough.

There’s a lot of ground to cover in this post, but in brief, we will:

  • Reason our way into a new computational model, analyze how it works, and base a parsing strategy on that analysis;
  • Create a collection of tools for constructing recursive descent parsers;
  • Briefly grapple with several issues that crop up in grammar and parser design;
  • Use a parser for a simple expression language to build a working four-function calculator program, with variables, parenthetical grouping, and almost civilized error detection and recovery.

2 Balanced Parentheses

Before implementing a predicate to check whether a string consists of balanced parentheses, let’s consider how to describe such strings. The simplest balanced paren string is just an empty string–with neither left nor right parentheses, or anything else, it’s obviously balanced. The next is a single pair ’()’. From there, we can do some combination of

  • inserting a balanced-paren string between the left and right parentheses and
  • appending a balanced-paren string to the existing expression.

We know that regular expressions won’t do, since they can’t handle recursive structures or algebraic relationships, so we’ll head straight for a grammar, specifically, a context-free grammar (CFG), like so:

2018-12-08-part1_2_0.png

where L and R stand in for left and right parentheses, respectively.

A CFG is made up of a series of rules (also called productions or substitutions), each describing how the single symbol to the left of the arrow (the nonterminal symbol) relates to the symbols on the right (a sequence of nonterminals and terminals). In this particular language, L, R, and ε are the terminal symbols, so named because they are not described by any rule in the grammar. p, meanwhile, is described by a rule, and thus is a nonterminal symbol. (If we had more than one symbol left of an arrow, we’d consider the additional symbols to be restrictions on when the production could be applied–the production, and thus the grammar, would be context-sensitive. We’re not looking at such languages today.)

For brevity, rules for transforming the same nonterminal are usually collapsed, with alternatives separated by pipe characters, as in:

2018-12-08-part1_4_0.png

For ease of handling in a text editor, we’ll use a stripped-down version of the Backus-Nauer Form (BNF) for our grammar notation. Converting to BNF and using more descriptive naming than is typical in mathematical presentations of grammars:

parens :
       | '(' parens ')' parens

Note that the ε-production became the empty production in BNF, placed at the beginning for clarity. (In a CFG, there is no significance attached to the ordering of productions—they can theoretically be tried or applied in any order.)

3 Brief Diversion: Regular Grammars

If we were to restrict a CFG such that each production could contain at most one nonterminal, and that symbol could appear only at the right end of the production, we would end up with something called a right-regular grammar (likewise, restricting it to appear at the left end would lead to a left-regular grammar). Such regular grammars produce regular languages, the very languages describable by regular expressions and recognizable by finite state machines. For example, to describe a run of any number of digits in a regex, we might write,

/[0-9]*/

We could also write a grammar:

digit  : 0 | 1 | 2 | 3 | 4 | 5 | 6 | 8 | 9
digits :
       | digit digits

We could also write digits to produce a left-regular grammar:

digits :
       | digits digit

While it’s by now clear that regular languages are a subset of context free languages, it should also be clear that regular expressions are a far more compact notation for specifying them. That said, it’s worth keeping this connection mind as a tool for building up more complex regular expressions.

4 Back to Balanced Parentheses

Knowing how to describe balanced paren strings, how can we test whether a given string is one? The very simplest thing we could do is to maintain a counter, initialized at zero, that we increment or decrement on encountering an open or close parenthesis, respectively:

def is_balanced_paren(s):
    opened = 0
    for c in s:
        if c == '(': opened += 1
        elif c == ')': opened -= 1
        else: return False      # illegal character
        if opened < 0: return False  # closed too many
    return opened == 0

Which performs as expected:

def test(fn, *battery):
    width = max(len(case) for case, _ in battery) + 2
    fmt = '%%-%ds gives %%-8s --> %%s' % width

    for (case, expected) in battery:
        result = fn(case)
        success = 'pass' if result == expected else 'fail'
        case_str = "'%s'" % case
        print(fmt % (case_str, result, success))

PAREN_TESTS = [
    ('()()((())())', True),
    ('()()((())()',  False),
    (')(',           False),
    ('',             True),  
]

test(is_balanced_paren, *PAREN_TESTS)
'()()((())())' gives True     --> pass
'()()((())()'  gives False    --> pass
')('           gives False    --> pass
''             gives True     --> pass

With our toy example working, let’s motivate something slightly more complex: We’ll allow parentheses (’()’), brackets (’[]’), and braces (’{}’) in the same string—all balanced. The intuition for the grammar is the same, but now it looks like:

parens :
       | '(' parens ')' parens
       | '[' parens ']' parens
       | '{' parens '}' parens

Suddenly a simple counter, or set of counters, won’t cut it—the closing braces, brackets, and parentheses must also appear in the proper order. What’s called for is a stack:

def is_balanced_parens(s):
    # using a sentinel for end of input lets us avoid  explicit length checks
    end = '$'

    stack = [end] 
    s2 = list(s) + [end]

    right = dict(('()', '[]', '{}'))
    lefts = set(right)
    for c in s2:
        if c == stack[-1] == end: return True
        elif c == stack[-1]: del stack[-1]
        elif c in right: stack.append(right[c])
        else: return False

Adding some mixed enclosures cases and testing again, we see that everything works properly:

MIXED_ENCLOSURE_TESTS = [
    ('[({}()[[{}]()])(((())))]', True),
    ('[({}()[[{}]()])(((())))',  False)
]

test(is_balanced_parens, *MIXED_ENCLOSURE_TESTS, *PAREN_TESTS)
'[({}()[[{}]()])(((())))]' gives True     --> pass
'[({}()[[{}]()])(((())))'  gives False    --> pass
'()()((())())'             gives True     --> pass
'()()((())()'              gives False    --> pass
')('                       gives False    --> pass
''                         gives True     --> pass

5 Pushdown Automata

Now that we have an approach that’s more generally applicable, let’s examine how it works:

  • As in the case of our first implementation of an IP address recognizer, there’s a tiny bit of state—specifically, there are three states in which the checking process can be:
    • A working state (from invocation right up to return);
    • A failure state (whenever we return False)
    • An accepting state (whenever we return True)
  • The next state is determined by:
    • The current state;
    • The current input symbol; and,
    • The symbol atop the stack.
  • As we proceed, we can manipulate the stack based on the selected transition.

By bolting a stack onto a state machine, thereby solving the problem of limited memory that constrains normal finite state machines, we’ve invented a more powerful device, called a Push-Down Automaton (PDA), that can recognize any context-free language—like HTML and regular expression syntax. In more formal treatments, PDAs are specified by a tuple,

2018-12-08-part1_14_0.png

where

  • Q is a set of states;
  • Σ is the input alphabet (i.e., the set of legal input symbols);
  • Γ is the stack alphabet (i.e., the set of symbols that can be on the stack);
  • δ: Q × Σε × Γε → ℘(Q × Γε) is the transition function;
  • q0 ∈ Q is the initial state; and,
  • F ⊆ Q is the set of accepting states.

In our PDA, we know that we have three states (call them “scan”, “accept”, and “reject”) and we know their respective roles. We know the input alphabet (including the ’\(' that we append to the input string) and we know the stack alphabet (the closing markers and '\)’). The only thing left is to specify the transition function. In abbreviated form (omitting transitions to the reject state):

state stack input stack op next state
scan any ’(’ push ’)’ scan
scan any ’{’ push ’}’ scan
scan any ’[’ push ’] scan
scan ’)’ ’)’ pop scan
scan ’}’ ’}’ pop scan
scan ’]’ ’]’ pop scan
scan $ $ pop accept

This transition function is the program that enables our PDA to recognize the language of interest over the input alphabet. The question that immediately arises when we consider more complex languages is: How can we take a grammar describing a language and come up with a PDA that recognizes it?

6 Top-Down Parsing

There are two main approaches to programming a PDA, and therefore parsing context free languages: top-down and bottom up. We’ll concern ourselves with top-down for now.

For simplicity, let’s go back to the original balanced paren grammar:

parens : 
       | '(' parens ')' parens

We have two productions, and we’ll restrict ourselves to our favorite three states. Instead of only pushing complementary symbols onto the stack, though, let’s try something different:

  1. We’ll prime the stack with our single nonterminal.
  2. Whenever there’s a terminal atop the stack, we’ll match it in the input stream, removing it from the stack as we do.
  3. Whenever there is a nonterminal atop the stack, we’ll replace it with the symbols from the production we expect to use. Because we’re ultimately going for replacement with terminals, and since those are matched only when atop the stack, we have to make sure they appear on the top in the proper order. So, we’ll push them on from right to left (i.e., opposite their order in the production).

Maybe a transition function will make things clearer:

state stack input stack operations next state consume input?
scan parens any poppush parens, ’)’, parens, ’(’ scan no
scan parens any pop scan no
scan ’(’ ’(’ pop scan yes
scan ’)’ ’)’ pop scan yes
scan $ $ pop accept yes

Consider the input string ’(())()’. Based on the above, the stack usage and input consumption look like:

operation stack unconsumed notes
initialize $ parens ’(())()’ $  
expand $ parens ’)’ parens ’(’ ’(())()’ $  
match $ parens ’)’ parens ’())()’ $  
expand $ parens ’)’ parens ’)’ parens ’(’ ’())()’ $  
match $ parens ’)’ parens ’)’ parens ’))()’ $  
expand $ parens ’)’ parens ’)’ ’))()’ $ ε
match $ parens ’)’ parens ’)()’ $  
expand $ parens ’)’ ’)()’ $  
match $ parens ’()’ $  
expand $ parens ’)’ parens ’(’ ’()’ $  
match $ parens ’)’ parens ’)’ $  
expand $ parens ’)’ ’)’ $  
match $ parens $  
expand $ $  
match     accept

Whenever a nonterminal, call it x, appears atop the stack, we replace it with a corresponding series of symbols we expect to match in the input; any terminal appearing atop the stack is immediately matched against the input. Once we’re done matching all of the symbols that replaced x, we’re by definition done matching x itself, and we can go on matching the next thing (call it y). Another way to think about this is that a symbol on the stack is actually a directive to match one of the corresponding productions in the input, and the expansion operation is the first part of how that happens for nonterminals. It’s hardly a leap for us to equate the stack expansion of a nonterminal with calling a procedure to match that expansion from the input—in fact, that’s the very basis of recursive descent parsing, which is the primary method of writing parsers for context free languages by hand.

7 Alternative Views

We’re using CFGs to describe strings that are produced by external processes, like human programmers or very chatty sensors. A different view is that one can take a CFG and, beginning with the start symbol, repeatedly apply the various productions to generate different strings in the corresponding context free language. Continuing with the above input and at each step choosing the leftmost nonterminal to expand, the derivation (with ε’s omitted) would look like:

2018-12-08-part1_17_0.png

This leftmost derivation corresponds closely with the actions taken by the PDA we just used to recognize the input string as a member of the balanced-paren language.

A different and more relevant (for us) view is that a CFG provides a means of understanding the structure of utterances in the corresponding language. More than merely recognizing when a string belongs in, for example, parens, we want to decompose such a string into its parts—i.e., we wish to parse the string. In the case of our input, ’(())()’, the parse tree would look like:

2018-12-08-part1_19_0.png

Each leaf node in the tree is an instance of a terminal symbol from the CFG, and each interior node is an instance of a nonterminal. Generating this structure, either explicitly or logically, is the focus of parsing, also called syntactic analysis. Interpreting a parse tree to, for example, generate compiled programs or carry out diabolical doomsday scenarios, is the process of semantic analysis. As we’ll soon see,

  • There is a close correspondence between the structure of a parse tree and the ease of certain kinds of semantic analysis; and,
  • It’s possible to engineer a CFG for a given language to produce parse trees that are more convenient for the processing that we intend to do.

8 Descending, Recursively

In recursive descent parsing, we represent every symbol by a function capable of matching it in the input. Using this approach, our implementation of is_balanced_parens becomes:

# We'll adopt the convention that a parsing function will return, on success,
# a pair (parsed, rest), where
# * parsed is a representation of what was matched from the input, and
# * rest is the remaining input

## terminal parsers

def LPAREN(s):
    'match open paren'
    if s and s[0] == '(': return s[0], s[1:]

def RPAREN(s):
    'match close paren'
    if s and s[0] == ')': return s[0], s[1:]

def EPSILON(s): 
    'match epsilon'
    return '', s

## nonterminals

def parens(s):
    '''
    parens : LPAREN parens RPAREN parens
           | EPSILON
    '''

    # Using exception handling for dealing with alternatives keeps our code
    # from marching to the right. Also, note the arrangement: we're starting
    # with the production that will actually try to consume input. This matters
    # because we return the first production that succeeds, and EPSILON never
    # fails.

    try:
        # this chaining on rest accomplishes sequential matches
        left,      rest = LPAREN(s)
        p_inside,  rest = parens(rest)
        right,     rest = RPAREN(rest)
        p_outside, rest = parens(rest)
        return (left, p_inside, right, p_outside), rest
    except TypeError:
        return EPSILON(s)

def is_balanced_parens(s):
    # Parsing a valid string will consume it entirely.
    parsed, rest = parens(s)
    return rest == ''

And the new is_balanced_parens responds as expected:

test(is_balanced_parens, *PAREN_TESTS)
'()()((())())' gives True     --> pass
'()()((())()'  gives False    --> pass
')('           gives False    --> pass
''             gives True     --> pass

9 Basic Expression Parsing

Let’s try something more ambitious: parsing simple mathematical expressions. The nonterminals will be:

  • OPERATOR: ’+’, ’-’, ’*’, and’/’;
  • NUMBER; and,
  • Open and close parentheses (LPAREN and RPAREN, respectively).

The very simplest thing we can try is to simply say something like:

expression : expression OPERATOR expression
           | LPAREN expression RPAREN
           | NUMBER

Simple, right?

Before translating this to code, let’s take a few minutes to implement better tools for constructing recursive descent parsers. The exception-based implementation of parens above is horrible. What we really want is to write code that looks more like the grammar. Without going all the way to producing callable objects composable under various operations:

import re

## utilties for making terminal parsers

def match(spec, s):
    'match a regular expression spec in s, skipping leading whitespace'
    s = s.lstrip()
    x = re.match(spec, s)
    if x: return s[:x.end()], s[x.end():]

def literal(spec, s):
    'match a literal string spec in s, skipping leading whitespace'
    s = s.lstrip()
    n = len(spec)
    if s[:n] == spec: return spec, s[n:]

## utilities for making nonterminal parsers

def seq(*syms):
    'return a parser that matches sequences of symbols'
    def parse(s):
        acc = []
        rest = s
        for sym in syms:
            x = sym(rest)
            if not x: return False
            matched, rest = x
            acc.append(matched)
        return acc, rest
    return parse

def alt(*syms):
    'return a parser that matches alternatives (first match wins)'
    def parse(s):
        for sym in syms:
            x = sym(s)
            if x: return x
        return False
    return parse

def parse(start, s):
    'match only if all input is consumed'
    x = start(s)
    if x:
        matched, rest = x
        if rest.strip() == '': return matched

Also, let’s create a few utilities to simplify handling the parse trees that we’ll create:

# Time spent on a nice printable representation of complex data structures 
# repays itself at debugging time.

def indent(s, tab='    '): return tab + s.replace('\n', '\n' + tab)

class symbol:
    def __init__(self, type_, value, terminal=False):
        self.type = type_
        self.value = value
        self.terminal = terminal

        if not self.terminal and type(self.value) != list: 
            self.value = [self.value]

    def __repr__(self):
        if self.terminal: return f"{self.type}: '{self.value}'"
        else:
            header = self.type + ':'
            body = '\n'.join(map(repr, self.value))

            if body.count('\n') == 0: return header + ' ' + body
            else: return header + '\n' + indent(body)

    def __iter__(self):
        if not self.terminal: return iter(self.value)
    else: return iter(())

class parse_result(tuple):
    def __repr__(self):
        return '''
        parse
        =====
        %s

        unconsumed
        ==========
        %s
        '''.replace('''
        ''', '\n') % self

class parser:
    def __init__(self, f, terminal=None):
        self.f = f
        if terminal == None:
            # as convenience, infer terminality by f's name being all caps
            self.terminal = f.__name__ == f.__name__.upper()
        else:
            self.terminal = terminal

    def __call__(self, s):
        x = self.f(s)

        if x: 
            matched, rest = x
            sym = symbol(self.f.__name__, matched, self.terminal)
            return parse_result((sym, rest))

Applying these to the balanced paren language, we get:

## Terminals

@parser
def LPAREN(s):
    'match open paren'
    if s and s[0] == '(': return s[0], s[1:]

@parser
def RPAREN(s):
    'match close paren'
    if s and s[0] == ')': return s[0], s[1:]

@parser
def EPSILON(s): 
    'match epsilon'
    return '', s

## Nonterminals

@parser
def parens(s):
    '''
    parens : LPAREN parens RPAREN parens
           | EPSILON
    '''
    return alt(seq(LPAREN, parens, RPAREN, parens), 
               EPSILON)(s)

Turning parens on the string ’(())()’ results in the parse tree we saw before:

parens('(())()')

2018-12-08-part1_32_0.png

Now we’re ready to translate our simple expression grammar into code:

@parser
def NUMBER(s): return match('\d+', s)

@parser
def OPERATOR(s): return match('[+*/-]', s)

@parser
def expression(s):
    '''
    expression : expression OPERATOR expression
               | LPAREN expression RPAREN
               | NUMBER
    '''
    return alt(seq(expression, OPERATOR, expression),
               seq(LPAREN, expression, RPAREN),
               NUMBER)(s)

Testing it on a simple addition reveals a problem, though:

expression('4 + 5 + 6')
---------------------------------------------------------------------------

RecursionError                            Traceback (most recent call last)

<ipython-input-18-a70ee2db7313> in <module>()
----> 1 expression('4 + 5 + 6')


[...]


RecursionError: maximum recursion depth exceeded

What’s happened is that the very first production tested,

expression : expression OPERATOR expression

led to a call to expression without consuming any input; that led to another call, and so, until we destroyed our call stack. This left recursive production in the grammar led to infinite recursion in the parser, and there’s no mere trick of implementation that will get rid of it. We have to revisit the grammar itself.

Suppose we came up with the concept of a subexpression, or subex, which would represent anything that could be combined with another subex using an OPERATOR. We could then say that an expression is nothing more than a series of OPERATOR-separated subexpressions. If a subex is just a NUMBER or an expression enclosed in parentheses,

expression : subex OPERATOR expression
           | subex

subex : NUMBER
      | LPAREN expression RPAREN

This kind of transformation is sufficient to break the left recursion. While it’s not strictly necessary, further factoring can often make life easier later. For example, there are two type of subex:

enclosed   : LPAREN expression LPAREN

subex : NUMBER | enclosed

Now to revisit the implementation:

@parser
def expression(s):
    '''
    expression : subex OPERATOR expression
               | subex
    '''
    return alt(seq(subex, OPERATOR, expression), 
               subex)(s)

@parser
def subex(s): 
    'subex : NUMBER | enclosed'
    return alt(NUMBER, enclosed)(s)

@parser
def enclosed(s): 
    'enclosed : LPAREN expression RPAREN'
    return seq(LPAREN, expression, RPAREN)(s)

Calling expression('4 + 5*6 - 7') generates the following parse tree:

2018-12-08-part1_40_0.png

Okay, we’ve fixed the left recursion problem only to highlight a new one: We know that

2018-12-08-part1_42_0.png

In general, parse trees are interpreted from the bottom up, with the tightest constructs evaluated first (i.e., by postorder traversal). Applying this to the parse tree we just generated, we see that it really describes

2018-12-08-part1_44_0.png

which is wrong.

The issue here is that our grammar says nothing about operator precedence. We can fix this is by elaborating the notion of an expression further: We’ll now treat it as a series of terms combined under addition and subtraction. Each term will then consist of one or more factors, multiplied or divided:

expression : term ADDOP expression
           | term

term : factor MULOP term
     | factor

factor : NUMBER | enclosed

where ADDOP represents the additive operators (+ and -) and MULOP represents the multiplicative operators (* and /).

Adapting our parser:

@parser
def ADDOP(s): return match('[+-]', s)

@parser
def MULOP(s): return match('[*/]', s)

@parser
def expression(s):
    '''
    expression : term ADDOP expression
               | term
    '''
    return alt(seq(term, ADDOP, expression), 
               term)(s)

@parser
def term(s):
    '''
    term : factor MULOP term
         | factor
    '''
    return alt(seq(factor, MULOP, term), 
               factor)(s)

@parser
def factor(s): 
    'factor : NUMBER | enclosed'
    return alt(NUMBER, enclosed)(s)

And now expression correctly parses the test string:

expression('4 + 5*6 - 7')

2018-12-08-part1_48_0.png

So far, so good, but another problem lurks: Associativity. The parse tree that expression generates for '1 - 2 - 3' is:

expression('1-2-3')

2018-12-08-part1_50_0.png

This parse tree says that 1-2-3 = 1-(2-3) = 2, which we know to be incorrect. The problem is that subtraction is left associative (as is division), while our grammar has all operators as right associative. We can fix this by changing the grammar yet again, arriving at something closer to the second expression grammar that we had (when we cured the left recursion).

Think of an expression as a starting term followed by a series of increments:

expression : term increments

increments :
           | increment increments

Each increment consists of an operation (either addition or subtraction) and a term:

increment : ADDOP term

Likewise, each term is composed of a starting factor followed by a series of scalings, each composed of an operation (multiplication or division) and a factor:

term : factor scalings

scalings :
         | scaling scalings

scaling : MULOP factor

With this,

@parser
def expression(s):
    'expression : term increments'
    return seq(term, increments)(s)

@parser
def increments(s): 
    '''
    increments : 
               | increment increments
    '''
    return alt(seq(increment, increments), 
               EPSILON)(s)

@parser
def increment(s): 
    'increment : ADDOP term'
    return seq(ADDOP, term)(s)

@parser
def term(s): 
    'term : factor scalings'
    return seq(factor, scalings)(s)

@parser
def scalings(s): 
    '''
    scalings :
             | scaling scalings
    '''
    return alt(seq(scaling, scalings), 
               EPSILON)(s)

@parser
def scaling(s): 
    'scaling : MULOP factor'
    return seq(MULOP, factor)(s)

@parser
def factor(s): 
    'factor : NUMBER | enclosed'
    return alt(NUMBER, enclosed)(s)

@parser
def enclosed(s): 
    'enclosed : LPAREN expression RPAREN'
    return seq(LPAREN, expression, RPAREN)(s)

We can now build a parse tree for 1-2-3 that accurately describes what we need to do to correctly evaluate it:

expression('1 - 2 - 3')

2018-12-08-part1_54_0.png

We could even process this tree to evaluate the expression. In fact, let’s do that.

There are a couple of perfectly good options for evaluating expressions based on our parser:

  • We could process the tree after it’s finished, explicitly performing a post-order traversal and propagating intermediate results upward.
  • We could outfit the parsers with evaluation logic triggered immediately on matching the associated symbols.

The latter option is referred to as syntax-directed translation, since the parser drives the entire translation process (think of evaluation as the process of translating an expression to a final value). Since syntax-directed translation is more common in compiler design, and because it’s an elegant approach, we’ll go that way. Let’s start by adding to parser:

class parser(parser):
    def on(self, handler): 
        'set a symbol handler'
        self.handler = handler

    def __call__(self, s):
        x = self.f(s)
        if x: 
            matched, rest = x
            sym = symbol(self.f.__name__, matched, self.terminal)

            # call the symbol handler if present
            if hasattr(self, 'handler'): self.handler(sym)

            return parse_result((sym, rest))

Now we can write handlers for each symbol we care to process. Tagging each symbol with a result attribute accumulated up the parse tree:

from functools import reduce
from operator import mul

def product(xs): return reduce(mul, xs, 1)

@NUMBER.on
def h_NUMBER(sym): sym.result = int(sym.value)    

@enclosed.on
def h_enclosed(sym): sym.result = sym.value[1].result

@factor.on
def h_factor(sym): sym.result = sym.value[0].result

@scaling.on
def h_scaling(sym):
    op, mag_ = sym.value
    mag = mag_.result
    sym.result = mag if op.value == '*' else 1/mag

@scalings.on
def h_scalings(sym):
    if sym.value[0].type == 'EPSILON': sym.result = 1
    else:
        sym.result = sym.value[0].result * sym.value[1].result

@increment.on
def h_increment(sym):
    op, mag_ = sym.value
    mag = mag_.result
    sym.result = mag if op.value == '+' else -mag

@increments.on
def h_increments(sym):
    if sym.value[0].type == 'EPSILON': sym.result = 0
    else:
        sym.result = sym.value[0].result + sym.value[1].result

@term.on
def h_term(sym):
    a, b = sym.value
    sym.result = a.result * b.result

@expression.on
def h_expression(sym):
    a, b = sym.value
    sym.result = a.result + b.result

A few test cases, including the troubling expressions from before, give us confidence that we’re properly evaluating positive integer arithmetic:

def calc(s): return expression(s)[0].result

test(calc,
     ('4 + 5*6 - 7', 4 + 5*6 - 7),
     ('1 - 2 - 3',   1 - 2 - 3),
     ('(30 + 40)/(3 + 4)',  (30 + 40)/(3 + 4))
    )
'4 + 5*6 - 7'       gives 27       --> pass
'1 - 2 - 3'         gives -4       --> pass
'(30 + 40)/(3 + 4)' gives 10.0     --> pass

10 Parsing BNF Grammars

Now that we have a straightforward way to produce recursive descent parsers from BNF grammars by inspection, let’s try a useful party trick: Let’s parse our BNF grammars.

As we’ve seen, a grammar is a series of rules (we’ll insist on having at least one):

rules : rule rules
      | rule

Each rule consists of a name and a series of alternative productions, with appropriate delimiters:

rule : IDENTIFIER COLON productions

productions : production PIPE productions
            | production

And, finally, a production is nothing more than a series of symbols to substitute for the nonterminal being defined:

production :
           | IDENTIFIER production

Given these definitions, it should be trivial to write a parser by inspection with the tools we have, yes? Let’s try.

## terminals

@parser
def IDENTIFIER(s): return match('[a-zA-Z_][\w_]*', s)

@parser
def COLON(s): return literal(':', s)

@parser
def PIPE(s): return literal('|', s)

## nonterminals

@parser
def rules(s):
    '''
    rules :
          | rule rules
    '''
    return alt(seq(rule, rules), 
               EPSILON)(s)

@parser
def rule(s): 
    'rule : IDENTIFIER COLON productions'
    return seq(IDENTIFIER, COLON, productions)(s)

@parser
def productions(s):
    '''
    productions : production PIPE productions
                | production
    '''
    return alt(seq(production, PIPE, productions), 
               production)(s)

@parser
def production(s):
    '''
    production : 
               | IDENTIFIER production               
    '''
    return alt(seq(IDENTIFIER, production),
               EPSILON)(s)

When we run this on a simple grammar, though, we encounter a problem:

rules('''
a : b c d
d : e f
''')

2018-12-08-part1_66_0.png

We should have gotten a parse tree over the entire grammar; instead, the parser stopped early. The unconsumed input begins with a colon—it seems that somewhere in the parsing process we’re grabbing the IDENTIFIER that begins a rule definition (in this case ’d’) and appending it to the previous production. In fact, this very statement suggests a hypothesis we can test: That production is too greedy:

production('''
d
d: e f
''')

2018-12-08-part1_68_0.png

The problem is that there are two roles that IDENTIFIERs can play—to start rule definitions and to reference rules defined elsewhere—and our BNF grammar for BNF grammars allows the interpretation of an IDENTIFIER to shift from one position to the next without any intervening marker, like an explicit delimiter indicating the end of a rule. We have some options for fixing this:

  • We can add such a delimiter, e.g., a semicolon.
  • We can separate out the terminal parsers and run them over the input stream before the nonterminal parsers, thereby transforming the input stream into a stream of tokens (this is called lexical analysis, or lexing). That done, we can add to the grammar a new terminal, say, RULE_NAME, to absorb the colon; this terminal would only be used for introducing a rule definition.
  • We can elaborate the production parser to check whether a colon follows an IDENTIFIER that it’s about to consume and, if so, not consume it. This strategy is called lookahead.
  • Finally, we can elaborate the entire parsing process to incorporate a mechanism for generating all possible parses that consume any amount of the input stream. This strategy is called backtracking.

Adding a rule-end delimiter is an easy fix, and it’s a good one—it transforms the grammar into something that can be parsed very efficiently with the tools we already have. However, the task before us is to parse the BNF notation that we’re using. The approach of using a separate lexer and a grammar to take advantage of it would enable us to meet our objective, but it introduces more complexity than is actually required to do this job.

Having rejected the first two options, let’s consider the next.

11 Lookahead

The basic problem, as discussed, is that production is grabbing IDENTIFIERs without considering whether they might be introducing a new rule, something that could be determined by looking ahead one more token to see if a colon is there. It is very easy to modify the existing code to use this approach. First, let’s define a couple of functions to perform arbitrary lookahead without actually consuming input:

def follow(sym):
    def ret(s):
        if sym(s): return EPSILON(s)
    return ret

def nofollow(sym):
    def ret(s):
        if not sym(s): return EPSILON(s)
    return ret

Then, we just have to change production to check that a colon doesn’t immediately follow:

@parser
def production(s):
    '''
    production :
               | IDENTIFIER production
    '''
    return alt(seq(IDENTIFIER, production, nofollow(COLON)),
               EPSILON)(s)

Keep in mind that adding the lookahead parser to the first production introduces an additional EPSILON in the corresponding parse. Parsing the test grammar with rules now produces the following parse tree:

rules('''
a : b c d
d : e f
''')

2018-12-08-part1_75_0.png

And just like that, we’ve fixed our immediate problem of sorting out the proper role of an IDENTIFIER when we spot one.

For the curious, the parse tree for our expression grammar looks like:

rules('''
expression : term increments

increments :
           | increment increments

increment : ADDOP term

term : factor scalings

scalings :
         | scaling scalings

scaling : MULOP factor
''')

2018-12-08-part1_77_0.png

12 Expressions Revisited

The expression evaluator is nice, but there’s no substitute for the power of a full programming language. We won’t go that far right now, but let’s consider what it would take to:

  • Allow assignment and references of variables; and,
  • Evaluate sequences of statements.

First, let’s come up with a syntax for assigning the result of an expression to a variable:

assignment: IDENTIFIER EQUALS expression

To allow using previously defined variables in expressions, let’s introduce the concept of a reference and find a logical place in the grammar to add it:

reference: IDENTIFIER

factor: NUMBER | enclosed | reference

Without the idea of multiple =statement=s, executed in sequence, the presence of variables in our language is a bit nonsensical, so let’s fix that:

statements: statement statements
          | statement

statement: assignment | expression

That’s it. Together with the rest of our expression grammar defined as before, we’re ready to think about implementing the parser. Defining the following parsing functions:

## new terminals

@parser 
def EQUALS(s): return literal('=', s)

## new nonterminals

@parser
def statements(s):
    '''
    statements : statement statements
               | statement
    '''
    return alt(seq(statement, statements), statement)(s)

@parser
def statement(s):
    'statement : assignment | expression'
    return alt(assignment, expression)(s)

@parser
def assignment(s):
    'assignment : IDENTIFIER EQUALS expression'
    return seq(IDENTIFIER, EQUALS, expression)(s)

@parser
def reference(s): 
    'reference : IDENTIFIER'
    return IDENTIFIER(s)

And modifying one more:

## modified nonterminals

@parser
def factor(s): 
    'NUMBER | enclosed | reference'
    return alt(NUMBER, enclosed, reference)(s)

We’re ready for a quick test:

statements('''
x = 3 
4+5 - x
''')

2018-12-08-part1_85_0.png

With our newfound ability to comprehend a simple, Turing-incomplete language, let’s write a simple interpreter for it. As before, we’ll need handlers for the parsing functions. What’s different this time is that we need a place to store defined variables, and it has to be accessible by the handlers for both reference and assignment. We also need to decide the proper course of action for referencing a variable before it’s been assigned a value.

If we decide that referencing unassigned variables should generate an error, the handlers for assignment and reference might look like:

variables = {}

@assignment.on
def h_assignment(sym):
    name, _, val = sym.value
    variables[name.value] = val.result
    sym.result = val.result

@reference.on
def h_reference(sym): 
    name = sym.value[0].value
    if name not in variables: raise NameError(name)
    else: sym.result = variables[name]

Now to handle statements. Taking the value of the last statement in a sequence as being the value of the entire sequence (a pretty conventional approach),

@statement.on
def h_statement(sym): sym.result = sym.value[0].result

@statements.on
def h_statements(sym):
    if len(sym.value) == 1: sym.result = sym.value[0].result
    else: sym.result = sym.value[1].result

The handlers for the remaining symbols can be just as they were.

For convenience, let’s use our evaluator as the basis for a little interpreter. We don’t have to limit statements to one line each, but it’s simpler if we do. Also, we need to check that each line is entirely consumed—if it isn’t, there’s a syntax error for us to report. With these in mind:

variables = {}

while True:
    line = input('stmt> ').strip()
    if not line: break
    try:
        p = val, rest = statement(line)
        print(val.result)
        if rest.strip():
            pos = len(line) - len(rest)
            print('syntax error at pos. %s' % pos)
            print(p)
    except Exception as e:
        print(e.__class__, e.args)
stmt> 4 + 5*6 - 7
27
stmt> 1 - 2 - 3
-4
stmt> x - 4
<class 'NameError'> ('x',)
stmt> x = 4
4
stmt> (30 + x*10) / 7
10.0
stmt> 1 - 2 -
-1
syntax error at pos. 5

parse
=====
statement:
    expression:
        term:
            factor: NUMBER: '1'
            scalings: EPSILON: ''
        increments:
            increment:
                ADDOP: '-'
                term:
                    factor: NUMBER: '2'
                    scalings: EPSILON: ''
            increments: EPSILON: ''

unconsumed
==========
 -

stmt> x/(x-x)
<class 'ZeroDivisionError'> ('division by zero',)
stmt> x = 3*x
12
stmt> x
12
stmt> 

13 Wrapping Up

Our little interpreter leaves a great deal to be desired—even though it performs real arithmetic, it only accepts positive integers in its input; it only implements four basic operations; it has no looping or decision constructs; it lacks useful functions like sin, cos, and log; etc. Even so, because of the path we took in creating it, we can imagine how we might approach modifying the language to include at least some of the features we want, as well as how we might go about bringing those features into being.

When we first started, we couldn’t even properly handle nested parentheses, something required for a functioning regex engine. Now we can engineer context free grammars to produce parse trees that have desirable properties; use syntax-directed translation to perform computations using those trees; and implement simple interpreters with passable error detection and recovery. Along the way, we invented a computational model, the pushdown automaton (PDA); figured out a reasonable method for programming it; and used our understanding of its operation to devise a practical approach for writing parsers for deterministic context free languages by hand, with an array of little utilities to help us make short work of such a task.

There’s a bit more in the land of recursive descent parsing to cover, like backtracking, EBNF, and parser generators, but we’re at a natural stopping point. With that, let’s stop for now.


R E I N D E E R E F F E C T