One of my favorite subjects during college days was Parsing and compiler
construction (remember The dragon book?). Ever since the first time I tried lex/yacc (flex/bison) it was a mysterious piece of software that I wanted to
master using!
Zoom forward 10 years, a lot has changed in the field: LR(1) is passé, GLR and LL(1) are in-vogue (thanks to tools
like ANTLR)
When I thought about venturing into writing a parser for the YANG
modelling language (RFC 6020), I first did some study+survey, and I formed
some new opinions about parsing:
- No separate tokenization phase (scanner-less)
- Not LR(1) - too limited
- Not ANTLR! (though its a fantastic tool - it leans a lot towards Java!)
- Not Bison GLR! (no proper documentation!)
- Not C! (more about it later)
PEG is cool, and with a love for regex PEG attracted me way too much :-)
I tried many PEG based parser generators.
To start with I picked up Waxeye , though it has a choice of many targets, the C port is not optimal (author himself told, that more
work's needed).
Then I tried peg/leg, though its very similar to lex/yacc in usage, I was finding it hard to debug/modify the grammar. Part of the
problem could have also been the YANG ABNF grammar (as defined in the RFC).
There are no reserved words in YANG. I tried peg/leg much more than Waxeye, but eventually I gave up once I realized that I was
getting nowhere! So, C is gone - as I had only 2 options that had a C target!
Lua/LPeg: Though I'm
familiar with Lua, this was my first tryst with LPeg. I had heard a lot of praise for Roberto's work, and
I got to know why, quite soon. Within a few hours I was able to create a matcher
(deliberately avoiding the word parser here) that could recognize 80% of the
YANG files at my disposal! (had to tweak the matcher for the remaining 20%).
This time, I chose a different approach, instead of using the YANG ABNF
grammar, I created some simple relaxed rules.
Here is a brief description of the approach taken.
YANG statements can be either of the form:
1. identifier [value];
Or:
2. identifier [value] {
One or more of 1. (depending on identifier)
}
Note that value is optional. Since there are no reserved words, it leads to a
simple matcher (~30lines!). Even though there are no reserved words, we have some keywords,
and based on that we should be able to say whether the nesting is right or
wrong – I call this as the validation phase. And, that too, can be simplified
to triviality, if we keep the AST as a simple Lua array (table).
For e.g: A leaf can be under a container, but not
vice-versa:
Valid:
container C1{
leaf l1 {
type string;
}
}
Invalid:
leaf l1 {
container C1 {
}
}
I keep a list of allowed-children for each type of
node. Additional checks are performed with custom functions per node-type. And,
since these children are arranged as arrays – we can also ensure order.
That’s it! :-) - take a look at: https://github.com/aniruddha-a/lyang
Comments/feedback welcome.