Grammars

Group of named regexes that form a formal grammar

Grammars are a powerful tool used to destructure text and often to return data structures that have been created by interpreting that text.

For example, Perl 6 is parsed and executed using a Perl 6-style grammar.

An example that's more practical to the common Perl 6 user is the JSON::Tiny module, which can deserialize any valid JSON file, however the deserializing code is written in less than 100 lines of simple, extensible code.

If you didn't like grammar in school, don't let that scare you off grammars. Grammars allow you to group regexes, just as classes allow you to group methods of regular code.

Named Regexes

The main ingredient of grammars is named regexes. While the syntax of Perl 6 Regexes is outside the scope of this document, named regexes have a special syntax, similar to subroutine definitions:[1]

my regex number { \d+ [ \. \d+ ]? }

In this case, we have to specify that the regex is lexically scoped using the my keyword, because named regexes are normally used within grammars.

Being named gives us the advantage of being able to easily reuse the regex elsewhere:

say "32.51" ~~ &number;
say "15 + 4.5" ~~ / <number> \s* '+' \s* <number> /

regex isn't the only declarator for named regexes -- in fact, it's the least common. Most of the time, the token or rule declarators are used. These are both ratcheting, which means that the match engine won't back up and try again if it fails to match something. This will usually do what you want, but isn't appropriate for all cases:

my regex works-but-slow { .+ q }
my token fails-but-fast { .+ q }
my $s = 'Tokens won\'t backtrack, which makes them fail quicker!';
say so $s ~~ &works-but-slow; # True
say so $s ~~ &fails-but-fast; # False, the entire string get taken by the .+

The only difference between the token and rule declarators is that the rule declarator causes :sigspace to go into effect for the Regex:

my token non-space-y { 'once' 'upon' 'a' 'time' }
my rule space-y { 'once' 'upon' 'a' 'time' }
say 'onceuponatime'    ~~ &non-space-y;
say 'once upon a time' ~~ &space-y;

Creating Grammars

Grammar is the superclass that classes automatically get when they are declared with the grammar keyword instead of class. Grammars should only be used to parse text; if you wish to extract complex data, an action object is recommended to be used in conjunction with the grammar.

Protoregexes

If you have a lot of alternations, it may become difficult to produce readable code or subclass your grammar. In the Actions class below, the ternary in method TOP is less than ideal and it becomes even worse the more operations we we add:

grammar Calculator {
    token TOP { [ <add> | <sub> ] }
    rule  add { <num> '+' <num> }
    rule  sub { <num> '-' <num> }
    token num { \d+ }
}

class Calculations {
    method TOP ($/) { make $<add> ?? $<add>.made !! $<sub>.made; }
    method add ($/) { make [+] $<num>; }
    method sub ($/) { make [-] $<num>; }
}

say Calculator.parse('2 + 3', actions => Calculations).made;

# OUTPUT:
# 5

To make things better, we can use protoregexes that look like `:sym<...> adverbs on tokens:

grammar Calculator {
    token TOP { <calc-op> }

    proto rule calc-op          {*}
          rule calc-op:sym<add> { <num> '+' <num> }
          rule calc-op:sym<sub> { <num> '-' <num> }

    token num { \d+ }
}

class Calculations {
    method TOP              ($/) { make $<calc-op>.made; }
    method calc-op:sym<add> ($/) { make [+] $<num>; }
    method calc-op:sym<sub> ($/) { make [-] $<num>; }
}

say Calculator.parse('2 + 3', actions => Calculations).made;

# OUTPUT:
# 5

In the grammar, the alternation has now been replaced with <calc-op> , which is essentially the name of a group of values we'll create. We do so by defining a rule prototype with proto rule calc-op. Each of our previous alternations have been replaced by a new rule calc-op definition and the name of the alternation is attached with :sym<> adverb.

In the actions class, we now got rid of the ternary operator and simply take the .made value from the $<calc-op> match object. And the actions for individual alternations now follow the name naming pattern as in the grammar: method calc-op:sym<add> and method calc-op:sym<sub> >.

The real beauty of this method can be seen when you subclass that grammar and actions class. Let's say we want to add a multiplication feature to the calculator:

grammar BetterCalculator is Calculator {
    rule calc-op:sym<mult> { <num> '*' <num> }
}

class BetterCalculations is Calculations {
    method calc-op:sym<mult> ($/) { make [*] $<num> }
}

say BetterCalculator.parse('2 * 3', actions => BetterCalculations).made;

# OUTPUT:
# 6

All we had to add are additional rule and action to the calc-op group and the thing works—all thanks to protoregexes.

Special Tokens

TOP

grammar Foo {
    token TOP { \d+ }
}

The TOP token is the default first token attempted to match when parsing with a grammar—the root of the tree. Note that if you're parsing with .parse method, token TOP is automatically anchored to the start and end of the string (see also: .subparse).

Using rule TOP or regex TOP are also acceptable.

A different token can be chosen to be matched first using the :rule named argument to .parse, .subparse, or .parsefile Grammar methods.

ws

When rule instead of token is used, any whitespace after an atom is turned into a non-capturing call to ws. That is:

rule entry { <key> ’=’ <value> }

Is the same as:

token entry { <key> <.ws> ’=’ <.ws> <value> <.ws> } # . = non-capturing

The default ws matches "whitespace", such a sequence of spaces (of whatever type), newlines, unspaces, or heredocs.

It's perfectly fine to provide your own ws token:

grammar Foo {
    rule TOP { \d \d }
}.parse: "4   \n\n 5"; # Succeeds

grammar Bar {
    rule TOP { \d \d }
    token ws { \h*   }
}.parse: "4   \n\n 5"; # Fails

Always Succeed Assertion

The <?> is the always succeed assertion. When used as a grammar token, it can be used to trigger an Action class method. In the following grammar we look for Arabic digits and define a succ token with the always succeed assertion.

In the action class, we use calles to the succ method to do set up (in this case, we prepare a new element in @!numbers). In the digit method, we convert an Arabic digit into a Devanagari digit and add it to the last element of @!numbers. Thanks to succ, the last element will always be the number for the currently parsed digit digits.

grammar Digifier {
    rule TOP {
        [ <.succ> <digit>+ ]+
    }
    token succ   { <?> }
    token digit { <[0..9]> }
}

class Devanagari {
    has @!numbers;
    method digit ($/) { @!numbers[*-1] ~= $/.ord.&[+](2358).chr }
    method succ  ($)  { @!numbers.push: ''     }
    method TOP   ($/) { make @!numbers[^(*-1)] }
}

say Digifier.parse('255 435 777', actions => Devanagari.new).made;
# OUTPUT:
# (२५५ ४३५ ७७७)

Methods in Grammar

It's fine to use methods instead of rules or tokens in a grammar, as long as they return a /type/Cursor:

grammar DigitMatcher {
    method TOP (:$full-unicode) {
        $full-unicode ?? self.num-full !! self.num-basic;
    }
    token num-full  { \d+ }
    token num-basic { <[0..9]>+ }
}

The grammar above will attempt different matches depending on the arguments provided by parse methods:

say +DigitMatcher.subparse: '12७१७९०९', args => \(:full-unicode);
# OUTPUT:
# 12717909

say +DigitMatcher.subparse: '12७१७९०९', args => \(:!full-unicode);
# OUTPUT:
# 12

Action Objects

A successful grammar match gives you a parse tree of Match objects, and the deeper that match tree gets, and the more branches in the grammar are, the harder it becomes to navigate the match tree to get the information you are actually interested in.

To avoid the need for diving deep into a match tree, you can supply an actions object. After each successful parse of a named rule in your grammar, it tries to call a method of the same name as the grammar rule, giving it the newly created Match object as a positional argument. If no such method exists, it is skipped.

Here is a contrived example of a grammar and actions in action:

use v6;

grammar TestGrammar {
    token TOP { \d+ }
}

class TestActions {
    method TOP($/) {
        $/.make(2 + $/);
    }
}

my $actions = TestActions.new;
my $match = TestGrammar.parse('40', :$actions);
say $match;         # 「40」
say $match.made;    # 42

An instance of TestActions is passed as named argument actions to the parse call, and when token TOP has matched successfully, it automatically calls method TOP, passing the match object as an argument.

To make it clear that the argument is a match object, the example uses $/ as a parameter name to the action method, though that's just a handy convention, nothing intrinsic. $match would have worked too. (Though using $/ does give the advantage of providing $<capture> as a shortcut for $/<capture> ).

A slightly more involved example follows:

use v6;

grammar KeyValuePairs {
    token TOP {
        [<pair> \n+]*
    }
    token ws { \h* }

    rule pair {
        <key=.identifier> '=' <value=.identifier>
    }
    token identifier {
        \w+
    }
}

class KeyValuePairsActions {
    method identifier($/) { $/.make: ~$/                          }
    method pair      ($/) { $/.make: $<key>.made => $<value>.made }
    method TOP       ($/) { $/.make: $<pair>».made                }
}

my  $res = KeyValuePairs.parse(q:to/EOI/, :actions(KeyValuePairsActions)).made;
    second=b
    hits=42
    perl=6
    EOI

for @$res -> $p {
    say "Key: $p.key()\tValue: $p.value()";
}

This produces the following output:

Key: second     Value: b
Key: hits       Value: 42
Key: perl       Value: 6

Rule pair, which parsed a pair separated by an equals sign, aliases the two calls to token identifier to separate capture names to make them available more easily and intuitively. The corresponding action method constructs a Pair object, and uses the .made property of the sub match objects. So it (like the action method TOP too) exploits the fact that action methods for submatches are called before those of the calling/outer regex. So action methods are called in post-order.

The action method TOP simply collects all the objects that were .made by the multiple matches of the pair rule, and returns them in a list.

Also note that KeyValuePairsActions was passed as a type object to method parse, which was possible because none of the action methods use attributes (which would only be available in an instance).

In other cases, action methods might want to keep state in attributes. Then of course you must pass an instance to method parse.

Note that token ws is special: when :sigspace is enabled (and it is when we are using rule), it replaces certain whitespace sequences. This is why the spaces around the equals sign in rule pair work just fine and why the whitespace before closing } does not gobble up the newlines looked for in token TOP.