Parsing the command line with MGrammar – part 1

Let’s take a look at how we can use MGrammar to create a mini-DSL for a language most developers knows quite well; command line arguments. Most applications that accepts arguments on the command line in Windows (or in Linux/Un*x for that matter) is on the form:

Application.exe /a /b 123 /c “some input string goes here”

Some applications uses / as the “marker” that an argument is following, while other use - or . It is also quite common to allow both a verbose and an abbreviated version of the same command.

Well, that was the Command line 101. Here’s a brief explanation and some code on how we can do this with MGrammar + C#.

Here’s a screenshot of Intellipad where the MGrammar for the command line parsing DSL is displayed in the second pane (Click the image to show the picture in full size):

image If anyone has a Windows Live Writer plugin that does syntax highlighting of of M & MGrammar – please send me and email :-)

[Side note: Since the grammar for M & MGrammar is shipped as a part of the Samples in the Oslo SDK, it should be quite easy to put together a basic HTML syntax highlighter for both languages by loading the compiled grammar up and use the Lexer in System.Dataflow. Note to self: Investigate this further]

If you’re not familiar with MGrammar, I’ll walk through the cmd.mg for you. The general idea is that MGrammar helps you transform text (the input) into MGraph, a Directed-Label Graph, that can contain ordered an unordered nodes. The MGraph can then be traversed and acted upon.

The language CommandLineLang resides in a module named LarsW.Languages. the module keyword works pretty much as namespace NNN in C# and is used to divide the world into smaller pieces. Things that lives inside a module might be exposed to the outside by using the export keyword (not shown in the example) and thing from the outside might be welcomed in by using the import keyword.

The same way void Main(string[] args) is the default entry point in a C# application, syntax Main = …; Is the entry point in a MGrammar-based language.

In general, there are two things we need to work with in a MGrammar; syntax  and token statements. Last thing first; tokens are regular languages (regular expressions) where you can define the set of characters that will make a match using unicode characters, sequences of these and the normal Kleen operators; ? for optional elements, * for zero-to-many and + for one-to-many. Paranthesis – () – can be used for grouping of sets and | is used for choosing between two options. If you are familiar with regular expressions, writing tokens should be quite easy. Not that you can, and will, write the tokens in a hierarchical fashion, since your grammar would turn into a complete mess if you have to expand a lot of the regular expressions.

Syntax rules describe the context languages and can be made up by tokens and other syntax elements. You also have the possibility to project the matched tokens differently with the => operator. Without this you would have to do a lot more of coding in your backend code, so you will definitely want to exercise your grammar in Intellipad with some samples until you’re satisfied with the MGraph it outputs.

   syntax Rule = tToken tString tStatementTerminator;
   syntax Rule = tToken string:tString tStatementTerminator
             => Rule { Value { string }};

While not used in this sample, recursive rules is an essential building block in order to build grammars that can consume things like a comma-separated list (or repeating elements in general).

A repeating rule can look something like this:

   syntax Items = item:Item => Items {item}
                | items:Items item:Item
                 => Items {items, item};
   syntax Item = ...;

As you probably notice, the Items rule is used inside itself – so this rule is recursive. The “problem” with this type of syntax rules is that they produces nested nodes in the MGraph. This isn’t really a problem, but it makes the traversing in the backend more tedious. To mitigate this, The Oslo team came up with the valuesof() construction that will “flatten” out a set of hierarchical nodes for you:

   syntax Items = item:Item => Items {item}
                | items:Items item:Item
                  => Items {valuesof(items), item};

The Interleave keyword basically tells the lexer which tokens it can ignore. This will typically be whitespace and comments.

So now that we know some of  the basics, lets take a look at cmd.mg again. It basically consists of four syntax rules and four token rules. I’ve applied custom projections to most of the rules so that the MGraph production looks reasonable sane.

In the next installment of this series I will discuss how we can create a backend that will consume the MGraph and take action on the command line parameters.

The source is released under the Apache License 2.0 and can be found here: http://github.com/larsw/larsw.commandlineparser . This is the first project I release on GitHub, and if the experience is good, I believe I’ll continue to use it.

There’s a Download button that you can use to download either a zip or tar ball of the source tree, if you haven’t installed git.

kick it on DotNetKicks.com

Be Sociable, Share!