Grammar Tutorial

The Broad Concept of Perl Grammars

Grammars parse strings and return data structures from those strings. Grammars can be used to prepare a program for execution, to determine if a program can run at all (if it's a valid program), to break down a web page into constituent parts, or to identify the different parts of a sentence, among other things.

If you have strings to tame or interpret, grammars provide the tools to do the job.

The string could be a file that you're looking to break into sections; perhaps a protocol, like SMTP, where you need to specify which "commands" come after what user-supplied data; maybe you're designing your own domain specific language. Grammars can help.

Regular expressions work well for finding patterns in strings. However, for some tasks -- like finding multiple patterns at once, or combining patterns, or testing for patterns that may surround strings -- regular expressions, alone, are not enough.

When working with HTML, you could define a grammar to recognize HTML tags, both the opening and closing elements, and the text in between. You could then organize these elements into data structures, such as arrays or hashes.

Getting More Technical

The conceptual overview

Grammars are a special kind of class. You declare and define a grammar exactly as you would any other class, except that you use the grammar keyword instead of class.

grammar My::Gram { ... }

Grammars are made up of methods that define a regex, a token, or a rule. These are all varieties of different types of match methods. Once you have a grammar defined, you call it and pass in a string for parsing.

my $matchObject = My::Gram.parse($string);

Now, you may be wondering, if I have all these regexes defined that just return their results, how does that help with parsing strings that may be ahead or backwards in a string, or things that need to be combined from many of those regexes... and that's where grammar actions come in.

For every "method" you match in your grammar, you get an action you can use to act on that match. You also get an over-arching action that you can use to tie together all your matches and to build a data structure. This over-arching method is called TOP by default.

The technical overview

As already mentioned, grammars are declared using the grammar keyword and its "methods" are declared with regex, or token, or rule.

When a method (regex, token or rule) matches in the grammar, that matched string is put into a match object and keyed with the same name as the method.

grammar My::Gram {
    token TOP { <thingy> .* }
    token thingy { 'clever_text_keyword' }
}

If you were to use my $match = My::Gram.parse( $string ) and your string started with 'clever_text_keyword', you would get a match object back that contained 'clever_text_keyword' keyed by the name of 'thingy' in your match object.

The TOP method (whether regex, token, or rule) is the overarching pattern that must match everything (by default). If the parsed string doesn't match the TOP regex, your returned match object will be empty (Any).

As you can see above, in TOP, the "<thingy>" token is mentioned. The <thingy> is defined on the next line, "token thingy...". That means that 'clever_text_keyword' must be the first thing in the string or the grammar parse will fail and we'll get an empty match. This is great for recognizing a malformed string that should be discarded.

Learning By Example - a REST Contrivance

Let's suppose we'd like to parse a URI into the component parts that make up a RESTful request. We want the URIs to work like this:

So, if we have "/product/update/7/notify", we would want our grammar to give us a $match object that has a "subject" of "product", a "command" of "update", and "data" of "7/notify" (for now, anyway).

We'll start by defining a grammar class and some match methods for the subject, command, and data. We'll use the token declarator since we don't care about whitespace.

grammar REST {
    token subject { \w+ }
    token command { \w+ }
    token data    { .* }
}

So far, this REST grammar says we want a subject that will be just word characters, a command that will be just word characters, and data that will be everything else left in the string.

Next, we'll want to arrange these matching tokens within the larger context of the URI. That's what the TOP method allows us to do. We'll add the TOP method and place the names of our tokens within it, together with the rest of the patterns that makes up the overall pattern. Note how we're building a larger regex from our named regexes.

grammar REST {
    token TOP     { '/' <subject> '/' <command> '/' <data> }
    token subject { \w+ }
    token command { \w+ }
    token data    { .* }
}

With this code, we can already get the three parts of our RESTful request:

my $match = REST.parse('/product/update/7/notify');
say $match;
 
# OUTPUT: «「/product/update/7/notify」␤ 
#          subject => 「product」 
#          command => 「update」 
#          data => 「7/notify」» 

The data can be accessed directly by using $match<subject> or $match<command> or $match<data> to return the values parsed. They each contain match objects that you can work further with, such as coercing into a string ( $match<command>.Str ).

Adding some flexibility

So far, the grammar will handle retrieves, deletes and updates. However, a create command doesn't have the third part (the data portion). This means the grammar will fail to match if we try to parse a create URI. To avoid this, we need to make that last data position match optional, along with the '/' preceding it. This is accomplished by adding a question mark to the grouped '/' and data components of the TOP token, to indicate their optional nature, just like a normal regex.

So, now we have:

grammar REST {
    token TOP     { '/' <subject> '/' <command> [ '/' <data> ]? }
    token subject { \w+ }
    token command { \w+ }
    token data    { .* }
}
 
my $m = REST.parse('/product/create');
say $m<subject>$m<command>;
 
# OUTPUT: «「product」「create」␤» 

Next, assume that the URIs will be entered manually by a user and that the user might accidentally put spaces between the '/'s. If we wanted to accommodate for this, we could replace the '/'s in TOP with a token that allowed for spaces.

grammar REST {
    token TOP     { <slash><subject><slash><command>[<slash><data>]? }
    token subject { \w+ }
    token command { \w+ }
    token data    { .* }
 
    token slash   { \s* '/' \s* }
}
 
my $m = REST.parse('/ product / update /7 /notify');
say $m;
 
# OUTPUT: «「/ product / update /7 /notify」␤ 
#          slash => 「/ 」 
#          subject => 「product」 
#          slash => 「 / 」 
#          command => 「update」 
#          slash => 「 /」 
#          data => 「7 /notify」» 

We're getting some extra junk in the match object now, with those slashes. There's techniques to clean that up that we'll get to later.

Adding some constraints

We want our RESTful grammar to allow for CRUD operations only. Anything else we want to fail to parse. That means our "command" above should have one of four values: create, retrieve, update or delete.

There are several ways to accomplish this. For example, you could change the command method:

token command { \w+ }
 
# …becomes… 
 
token command { 'create'|'retrieve'|'update'|'delete' }

For a URI to parse successfully, the second part of the string between '/'s must be one of those CRUD values, otherwise the parsing fails. Exactly what we want.

There's another technique that provides greater flexibility and improved readability when options grow large: proto-regexes.

To utilize these proto-regexes (multimethods, in fact) to limit ourselves to the valid CRUD options, we'll replace "token command" with the following:

proto token command {*}
token command:sym<create>   { <sym> }
token command:sym<retrieve> { <sym> }
token command:sym<update>   { <sym> }
token command:sym<delete>   { <sym> }

The 'sym' keyword is used to create the various proto-regex options. Each option is named (e.g., sym<update>), and for that option's use, a special <sym> token is auto-generated with the same name.

The <sym> token, as well as other user-defined tokens, may be used in the proto- regex option block to define the specific 'match condition'. Regex tokens are compiled forms and, once defined, cannot subsequently be modified by adverb actions (e.g., :i). Therefore, as it's auto-generated, the special <sym> token is useful only where an exact match of the option name is required.

If, for one of the proto-regex options, a match condition occurs, then the whole proto's search terminates. The matching data, in the form of a match object, is assigned to the parent proto token. If the special <sym> token was employed and formed all or part of the actual match, then it's preserved as a sub-level in the match object, otherwise it's absent.

Using proto-regexes like this gives us a lot of flexibility. For example, instead of returning <sym>, which in this case is the entire string that was matched, we could instead enter our own string, or do other funny stuff. We could do the same with the "token subject" method and limit it also to only parsing correctly on valid subjects (like 'part' or 'people', etc.).

Putting our RESTful grammar together

This is what we have for processing our RESTful URIs, so far:

grammar REST
{
    token TOP { <slash><subject><slash><command>[<slash><data>]? }
 
    proto token command {*}
    token command:sym<create>   { <sym> }
    token command:sym<retrieve> { <sym> }
    token command:sym<update>   { <sym> }
    token command:sym<delete>   { <sym> }
 
    token subject { \w+ }
    token data    { .* }
    token slash   { \s* '/' \s* }
}

Let's look at various URIs and see how they work with our grammar.

my @uris = ['/product/update/7/notify',
            '/product/create',
            '/item/delete/4'];
 
for @uris -> $uri {
    my $m = REST.parse($uri);
    say "Sub: $m<subject> Cmd: $m<command> Dat: $m<data>";
}
 
# OUTPUT: «Sub: product Cmd: update Dat: 7/notify␤ 
#          Sub: product Cmd: create Dat: 
#          Sub: item Cmd: delete Dat: 4» 

With just this part of a grammar, we're getting almost everything we're looking for. The URIs get parsed and we get a data structure with the data.

The data token returns the entire end of the URI as one string. The 4 is fine. However from the '7/notify', we only want the 7. To get just the 7, we'll use another feature of grammar classes: actions.

Grammar Actions

Grammar actions are used within grammar classes to do things with matches. Actions are defined in their own classes, distinct from grammar classes.

You can think of grammar actions as a kind of plug-in expansion module for grammars. A lot of the time you'll be happy using grammars all by their own. But when you need to further process some of those strings, you can plug in the Actions expansion module.

To work with actions, you use a named parameter called "actions" which should contain an instance of your actions class. With the code above, if our actions class called REST-actions, we would parse the URI string like this:

my $matchObject = REST.parse($uriactions => REST-actions.new);
 
#   …or if you prefer… 
 
my $matchObject = REST.parse($uri:actions(REST-actions.new));

If you name your action methods with the same name as your grammar methods (tokens, regexes, rules), then when your grammar methods match, your action method with the same name will get called automatically. The method will also be passed the corresponding match object (represented by the $/ variable).

Let's turn to an example.

Grammars by Example with Actions

Here we are back to our grammar.

grammar REST
{
    token TOP { <slash><subject><slash><command>[<slash><data>]? }
 
    proto token command {*}
    token command:sym<create>   { <sym> }
    token command:sym<retrieve> { <sym> }
    token command:sym<update>   { <sym> }
    token command:sym<delete>   { <sym> }
 
    token subject { \w+ }
    token data    { .* }
    token slash   { \s* '/' \s* }
}

Recall that we want to further process the data token "7/notify", to get the 7. To do this, we'll create an action class that has a method with the same name as the named token. In this case, our token is named "data" so our method is also named "data".

class REST-actions
{
    method data($/{ $/.split('/'}
}

Now when we pass the URI string through the grammar, the "data" token match will be passed to the REST-actions' "data" method. The action method will split the string by the '/' character and the first element of the returned list will be the ID number (7 in the case of "7/notify").

But not really; there's a little more.

Keeping grammars with actions tidy with "make" and "made"

If the grammar calls the action above on data, the data method will be called, but nothing will show up in the big TOP grammar match result returned to our program. In order to make the action results show up, we need to call "make" on that result. The result can be many things, including strings, array or hash structures.

You can imagine that the "make" places the result in a special contained area for a grammar. Everything that we "make" can be accessed later by "made".

So instead of the REST-actions class above, we should write

class REST-actions
{
    method data($/{ make $/.split('/'}
}

When we add "make" to the match split (which returns a list), the action will return a data structure to the grammar that will be stored separately from the "data" token of the original grammar. This way, we can work with both if we need to.

If we want to access just the ID of 7 from that long URI, we access the first element of the list returned from the "data" action that we "made" with "make":

my $uri = '/product/update/7/notify';
 
my $match = REST.parse($uriactions => REST-actions.new);
 
say $match<data>.made[0];  # OUTPUT: «7␤» 
say $match<command>.Str;   # OUTPUT: «update␤» 

Here we call "made" on data, because we want the result of the action that we "made" (with "make") to get the split array. That's lovely! But, wouldn't it be lovelier if we could "make" a friendlier data structure that contained all of the stuff we want, rather than having to coerce types and remember arrays?

Just like TOP, which over-arches and matches the entire string, actions have a TOP method as well. We can "make" all of the individual match components, like "data" or "subject" or "command", and then we can place them in a data structure that we will "make" in TOP. When we return the final match object, we can then access this data structure.

To do this, we add the method TOP to the action class and "make" whatever data structure we like from the component pieces.

So, our action class becomes:

class REST-actions
{
    method TOP ($/{
        make { subject => $<subject>.Str,
               command => $<command>.Str,
               data    => $<data>.made }
    }
 
    method data($/{ make $/.split('/'}
}

Here in the TOP method, the "subject" remains the same as the subject we matched in the grammar. Also, "command" returns the valid <sym> that was matched (create, update, retrieve, or delete). We coerce each into .Str, as well, since we don't need the full match object.

We want to make sure to use the "made" method on the $<data> object, since we want to access the split one that we "made" with "make" in our action, rather than the proper $<data> object.

After we "make" something in the TOP method of a grammar action, we can then access all the custom values by calling the "made" method on the grammar result object. The code now becomes

my $uri = '/product/update/7/notify';
 
my $match = REST.parse($uriactions => REST-actions.new);
 
my $rest = $match.made;
say $rest<data>[0];   # OUTPUT: «7␤» 
say $rest<command>;   # OUTPUT: «update␤» 
say $rest<subject>;   # OUTPUT: «product␤» 

If the complete return match object is not needed, you could return only the made data from your action's TOP.

my $uri = '/product/update/7/notify';
 
my $rest = REST.parse($uriactions => REST-actions.new).made;
 
say $rest<data>[0];   # OUTPUT: «7␤» 
say $rest<command>;   # OUTPUT: «update␤» 
say $rest<subject>;   # OUTPUT: «product␤» 

Oh, did we forget to get rid of that ugly array element number? Hmm. Let's make something new in the grammar's custom return in TOP... how about we call it "subject-id" and have it set to element 0 of <data>.

class REST-actions
{
    method TOP ($/{
        make { subject    => $<subject>.Str,
               command    => $<command>.Str,
               data       => $<data>.made,
               subject-id => $<data>.made[0}
    }
 
    method data($/{ make $/.split('/'}
}

Now we can do this instead

my $uri = '/product/update/7/notify';
 
my $rest = REST.parse($uriactions => REST-actions.new).made;
 
say $rest<command>;    # OUTPUT: «update␤» 
say $rest<subject>;    # OUTPUT: «product␤» 
say $rest<subject-id># OUTPUT: «7␤» 

Here's the final code:

grammar REST
{
    token TOP { <slash><subject><slash><command>[<slash><data>]? }
 
    proto token command {*}
    token command:sym<create>   { <sym> }
    token command:sym<retrieve> { <sym> }
    token command:sym<update>   { <sym> }
    token command:sym<delete>   { <sym> }
 
    token subject { \w+ }
    token data    { .* }
    token slash   { \s* '/' \s* }
}
 
 
class REST-actions
{
    method TOP ($/{
        make { subject    => $<subject>.Str,
               command    => $<command>.Str,
               data       => $<data>.made,
               subject-id => $<data>.made[0}
    }
 
    method data($/{ make $/.split('/'}
}

Hopefully this has helped introduce you to the grammars in Perl 6 and shown you how grammars and grammar action classes work together. For more information, check out the more advanced Perl Grammar Guide.

For more grammar debugging, see Grammar::Debugger. This provides breakpoints and color-coded MATCH and FAIL output for each of your grammar tokens.