Monday, October 10, 2016

How (not) to write a parser

One of my favorite subjects during college days was Parsing and compiler construction (remember The dragon book?)Ever since the first time I tried lex/yacc (flex/bison) it was a mysterious piece of software that I wanted to master using! 

Zoom forward 10 years, a lot has changed in the field: LR(1) is passé, GLR and LL(1) are in-vogue (thanks to tools like ANTLR) 

When I thought about venturing into writing a parser for the YANG modelling language (RFC 6020), I first did some study+survey, and I formed some new opinions about parsing: 
- Don't roll out a RDP in desperation  
- No separate tokenization phase (scanner-less) 
- Not LR(1) - too limited 
- Not ANTLR! (though its a fantastic tool - it leans a lot towards Java!) 
- Not Bison GLR! (no proper documentation!) 
- Use PEG (yet another learning opportunity :-)) 
- Not C! (more about it later) 

PEG is cool, and with a love for regex PEG attracted me way too much :-) I tried many PEG based parser generators. 

To start with I picked up Waxeye , though it has a choice of many targets, the C port is not optimal (author himself told, that more work's needed). 

Then I tried peg/leg, though its very similar to lex/yacc in usage, I was finding it hard to debug/modify the grammar. Part of the problem could have also been the YANG ABNF grammar (as defined in the RFC). There are no reserved words in YANG. I tried peg/leg much more than Waxeye, but eventually I gave up once I realized that I was getting nowhere! So, C is gone - as I had only 2 options that had a C target!

Lua/LPeg: Though I'm familiar with Lua, this was my first tryst with LPeg. I had heard a lot of praise for Roberto's work, and I got to know why, quite soon. Within a few hours I was able to create a matcher (deliberately avoiding the word parser here) that could recognize 80% of the YANG files at my disposal! (had to tweak the matcher for the remaining 20%). This time, I chose a different approach, instead of using the YANG ABNF grammar, I created some simple relaxed rules. 

Here is a brief description of the approach taken.

YANG statements can be either of the form:

1.   identifier [value];

Or:

2.   identifier [value] {
            One or more of 1. (depending on identifier)
  }

Note that value is optional. Since there are no reserved words, it leads to a simple matcher (~30lines!). Even though there are no reserved words, we have some keywords, and based on that we should be able to say whether the nesting is right or wrong – I call this as the validation phase. And, that too, can be simplified to triviality, if we keep the AST as a simple Lua array (table).

For e.g: A leaf can be under a container, but not vice-versa:

Valid:
container C1{
  leaf l1 {
     type string;
  }
}

Invalid:
 leaf l1 {
  container C1 {
  }
}

I keep a list of allowed-children for each type of node. Additional checks are performed with custom functions per node-type. And, since these children are arranged as arrays – we can also ensure order.

That’s it! :-) - take a look at: https://github.com/aniruddha-a/lyang

Comments/feedback welcome.