Regexes

Pattern matching against strings

Regular expressions, regexes for short, are a sequence of characters that describe a pattern of text. Pattern matching is the process of matching those patterns to actual text.

Lexical conventions

Perl 6 has special syntax for writing regexes:

m/abc/;         # a regex that is immediately matched against $_ 
rx/abc/;        # a Regex object 
/abc/;          # a Regex object 

For the first two examples, delimiters other than the slash can be used:

m{abc};
rx{abc};

Note that neither the colon : nor round parentheses can be delimiters; the colon is forbidden because it clashes with adverbs, such as rx:i/abc/ (case insensitive regexes), and round parentheses indicate a function call instead.

Whitespace in regexes is generally ignored (except with the :s or, completely, :sigspace adverb).

Comments work within a regular expression:

/ word #`(match lexical "word") / 

Literals

The simplest case for a regex is a match against a string literal:

if 'properly' ~~ m/ perl / {
    say "'properly' contains 'perl'";
}

Alphanumeric characters and the underscore _ are matched literally. All other characters must either be escaped with a backslash (for example, \: to match a colon), or be within quotes:

/ 'two words' /;     # matches 'two words' including the blank 
/ "a:b"       /;     # matches 'a:b' including the colon 
/ '#' /;             # matches a hash character 

Strings are searched left to right, so it's enough if only part of the string matches the regex:

if 'abcdef' ~~ / de / {
    say ~$/;            # OUTPUT: «de␤» 
    say $/.prematch;    # OUTPUT: «abc␤» 
    say $/.postmatch;   # OUTPUT: «f␤» 
    say $/.from;        # OUTPUT: «3␤» 
    say $/.to;          # OUTPUT: «5␤» 
};

Match results are stored in the $/ variable and are also returned from the match. The result is of type Match if the match was successful; otherwise it's Nil.

Wildcards and character classes

The dot matches any character: .

An unescaped dot . in a regex matches any single character.

So, these all match:

'perl' ~~ /per./;       # matches the whole string 
'perl' ~~ / per . /;    # the same; whitespace is ignored 
'perl' ~~ / pe./;     # the . matches the r 
'speller' ~~ / pe.l/;   # the . matches the first l 

This doesn't match:

'perl' ~~ /. per /;

because there's no character to match before per in the target string.

Note that . now does match any single character, that is, it matches \n. So the text below match:

my $text = qq:to/END/ 
  Although I am a
  multi-line text,
  now can be matched
  with /.*/.
  END
  ;
 
say $text ~~ / .* /;
# OUTPUT «「Although I am a␤multi-line text,␤now can be matched␤with /.*/␤」» 

Backslashed, predefined character classes

There are predefined character classes of the form \w. Its negation is written with an upper-case letter, \W.

\d matches a single digit (Unicode property N) and \D matches a single character that is not a digit.

'ab42' ~~ /\d/ and say ~$/;     # OUTPUT: «4␤» 
'ab42' ~~ /\D/ and say ~$/;     # OUTPUT: «a␤» 

Note that not only the Arabic digits (commonly used in the Latin alphabet) match \d, but also digits from other scripts.

Examples for digits are:

U+0035 5 DIGIT FIVE
U+0BEB  TAMIL DIGIT FIVE
U+0E53  THAI DIGIT THREE
U+17E5  KHMER DIGIT FIVE

\h matches a single horizontal whitespace character. \H matches a single character that is not a horizontal whitespace character.

Examples for horizontal whitespace characters are

U+0020 SPACE
U+00A0 NO-BREAK SPACE
U+0009 CHARACTER TABULATION
U+2001 EM QUAD

Vertical whitespace like newline characters are explicitly excluded; those can be matched with \v, and \s matches any kind of whitespace.

\n matches a single, logical newline character. \n is supposed to also match a Windows CR LF codepoint pair; though it's unclear whether the magic happens at the time that external data is read, or at regex match time. \N matches a single character that's not a logical newline.

\s matches a single whitespace character. \S matches a single character that is not whitespace.

if 'contains a word starting with "w"' ~~ / w \S+ / {
    say ~$/;        # OUTPUT: «word␤» 
}

\t matches a single tab/tabulation character, U+0009. (Note that exotic tabs like the U+000B VERTICAL TABULATION character are not included here). \T matches a single character that is not a tab.

\v matches a single vertical whitespace character. \V matches a single character that is not vertical whitespace.

Examples for vertical whitespace characters:

U+000A LINE FEED
U+000B VERTICAL TABULATION
U+000C FORM FEED
U+000D CARRIAGE RETURN
U+0085 NEXT LINE
U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR

Use \s to match any kind of whitespace, not just vertical whitespace.

\w matches a single word character; i.e., a letter (Unicode category L), a digit or an underscore. \W matches a single character that isn't a word character.

Examples of word characters:

0041 A LATIN CAPITAL LETTER A
0031 1 DIGIT ONE
03B4 δ GREEK SMALL LETTER DELTA
03F3 ϳ GREEK LETTER YOT
0409 Љ CYRILLIC CAPITAL LETTER LJE

Predefined subrules:

<alpha>   <:L>     Alphabetic characters
<digit>   \d       Decimal digits
<xdigit>           Hexadecimal digit [0-9A-Fa-f]
<alnum>   \w       'alpha' plus 'digit'
<punct>            Punctuation and Symbols (only Punct beyond ASCII)
<graph>            'alnum' plus 'punct'
<space>   \s       Whitespace
<cntrl>            Control characters
<print>            'graph' plus 'space'but no 'cntrl'
<blank>   \h       Horizontal whitespace
<lower>   <:Ll>    Lowercase characters
<upper>   <:Lu>    Uppercase characters
<?same>            Matches between two identical characters
<?wb>              Word Boundary (zero-width assertion? suppress capture)
<?ww>              Within Word (zero-width assertion? suppress capture)

Unicode properties

The character classes mentioned so far are mostly for convenience; another approach is to use Unicode character properties. These come in the form <:property>, where property can be a short or long Unicode General Category name. These use pair syntax.

To match against a Unicode Property:

"a".uniprop('Script');                 # OUTPUT: «Latin␤» 
"a" ~~ / <:Script<Latin>/;           # OUTPUT: «「a」␤» 
"a".uniprop('Block');                  # OUTPUT: «Basic Latin␤» 
"a" ~~ / <:Block('Basic Latin')> /;    # OUTPUT: «「a」␤» 

Unicode General Categories:

Short Long
L Letter
LC Cased_Letter
Lu Uppercase_Letter
Ll Lowercase_Letter
Lt Titlecase_Letter
Lm Modifier_Letter
Lo Other_Letter
M Mark
Mn Nonspacing_Mark
Mc Spacing_Mark
Me Enclosing_Mark
N Number
Nd Decimal_Number (also Digit)
Nl Letter_Number
No Other_Number
P Punctuation (also punct)
Pc Connector_Punctuation
Pd Dash_Punctuation
Ps Open_Punctuation
Pe Close_Punctuation
Pi Initial_Punctuation (may behave like Ps or Pe depending on usage)
Pf Final_Punctuation (may behave like Ps or Pe depending on usage)
Po Other_Punctuation
S Symbol
Sm Math_Symbol
Sc Currency_Symbol
Sk Modifier_Symbol
So Other_Symbol
Z Separator
Zs Space_Separator
Zl Line_Separator
Zp Paragraph_Separator
C Other
Cc Control (also cntrl)
Cf Format
Cs Surrogate
Co Private_Use
Cn Unassigned

For example, <:Lu> matches a single, upper-case letter.

Its negation is this: <:!property>. So, <:!Lu> matches a single character that isn't an upper-case letter.

Categories can be used together, with an infix operator:

Operator Meaning
+ set union
| set union
& set intersection
- set difference (first minus second)
^ symmetric set intersection / XOR

To match either a lower-case letter or a number, write <:Ll+:N> or <:Ll+:Number> or <+ :Lowercase_Letter + :Number>.

It's also possible to group categories and sets of categories with parentheses; for example:

'perl6' ~~ m{\w+(<:Ll+:N>)}  # OUTPUT: «0 => 「6」␤» 

Enumerated character classes and ranges

Sometimes the pre-existing wildcards and character classes are not enough. Fortunately, defining your own is fairly simple. Within <[ ]>, you can put any number of single characters and ranges of characters (expressed with two dots between the end points), with or without whitespace.

"abacabadabacaba" ~~ / <[ a .. c 1 2 3 ]> /;
# Unicode hex codepoint range 
"ÀÁÂÃÄÅÆ" ~~ / <[ \x[00C0] .. \x[00C6] ]> /;
# Unicode named codepoint range 
"αβγ" ~~ /<[\c[GREEK SMALL LETTER ALPHA]..\c[GREEK SMALL LETTER GAMMA]]>/;

Within the < > you can use + and - to add or remove multiple range definitions and even mix in some of the unicode categories above. You can also write the backslashed forms for character classes between the [ ].

/ <[\d] - [13579]> /;
# starts with \d and removes odd ASCII digits, but not quite the same as 
/ <[02468]> /;
# because the first one also contains "weird" unicodey digits 

To negate a character class, put a - after the opening angle:

say 'no quotes' ~~ /  <-[ " ]> + /;  # matches characters except " 

A common pattern for parsing quote-delimited strings involves negated character classes:

say '"in quotes"' ~~ / '"' <-[ " ]> * '"'/;

This first matches a quote, then any characters that aren't quotes, and then a quote again. The meaning of * and + in the examples above are explained in section Quantifier.

Just as you can use the - for both set difference and negation of a single value, you can also explicitly put a + in front:

/ <+[123]> /  # same as <[123]> 

Quantifiers

A quantifier makes the preceding atom match a variable number of times. For example, a+ matches one or more a characters.

Quantifiers bind tighter than concatenation, so ab+ matches one a followed by one or more bs. This is different for quotes, so 'ab'+ matches the strings ab, abab, ababab etc.

One or more: +

The + quantifier makes the preceding atom match one or more times, with no upper limit.

For example, to match strings of the form key=value, you can write a regex like this:

/ \w+ '=' \w+ /

Zero or more: *

The * quantifier makes the preceding atom match zero or more times, with no upper limit.

For example, to allow optional whitespace between a and b you can write

/ a \s* b /

Zero or one: ?

The ? quantifier makes the preceding atom match zero or once.

From example, to match dog or dogs, you can write:

/ dogs? /

General quantifier: ** min..max

To quantify an atom an arbitrary number of times, use ** quantifier. The quantifier takes a single Int or a Range on the right hand side that specifies the number of times to match. If Range is specified, the end-points specify the minimum and maximum number of times to match.

say 'abcdefg' ~~ /\w ** 4/;      # OUTPUT: «「abcd」␤» 
say 'a'       ~~ /\w **  2..5/;  # OUTPUT: «Nil␤» 
say 'abc'     ~~ /\w **  2..5/;  # OUTPUT: «「abc」␤» 
say 'abcdefg' ~~ /\w **  2..5/;  # OUTPUT: «「abcde」␤» 
say 'abcdefg' ~~ /\w ** 2^..^5/# OUTPUT: «「abcd」␤» 
say 'abcdefg' ~~ /\w ** ^3/;     # OUTPUT: «「ab」␤» 
say 'abcdefg' ~~ /\w ** 1..*/;   # OUTPUT: «「abcdefg」␤» 

Only basic literal syntax for the right hand side of the quantifier is supported, to avoid ambiguities with other regex constructs. If you need to use a more complex expression; for example, a Range made from variables—enclose the Range into curly braces:

my $start = 3;
say 'abcdefg' ~~ /\w ** {$start .. $start+2}/# OUTPUT: «「abcde」␤» 
say 'abcdefg' ~~ /\w ** {π.Int}/;              # OUTPUT: «「abc」␤» 

Negative values are treated like zero:

say 'abcdefg' ~~ /\w ** {-Inf}/;     # OUTPUT: «「」␤» 
say 'abcdefg' ~~ /\w ** {-42}/;      # OUTPUT: «「」␤» 
say 'abcdefg' ~~ /\w ** {-10..-42}/# OUTPUT: «「」␤» 
say 'abcdefg' ~~ /\w ** {-42..-10}/# OUTPUT: «「」␤» 

If then, the resultant value is Inf or NaN or the resultant Range is empty, non-Numeric, contains NaN end-points, or has minimum effective end-point as Inf, the X::Syntax::Regex::QuantifierValue exception will be thrown:

(try say 'abcdefg' ~~ /\w ** {42..10}/  )
    orelse say ($!.^name$!.empty-range);
    # OUTPUT: «(X::Syntax::Regex::QuantifierValue True)␤» 
(try say 'abcdefg' ~~ /\w ** {Inf..Inf}/)
    orelse say ($!.^name$!.inf);
    # OUTPUT: «(X::Syntax::Regex::QuantifierValue True)␤» 
(try say 'abcdefg' ~~ /\w ** {NaN..42}/ )
    orelse say ($!.^name$!.non-numeric-range);
    # OUTPUT: «(X::Syntax::Regex::QuantifierValue True)␤» 
(try say 'abcdefg' ~~ /\w ** {"a".."c"}/)
    orelse say ($!.^name$!.non-numeric-range);
    # OUTPUT: «(X::Syntax::Regex::QuantifierValue True)␤» 
(try say 'abcdefg' ~~ /\w ** {Inf}/)
    orelse say ($!.^name$!.inf);
    # OUTPUT: «(X::Syntax::Regex::QuantifierValue True)␤» 
(try say 'abcdefg' ~~ /\w ** {NaN}/)
    orelse say ($!.^name$!.non-numeric);
    # OUTPUT: «(X::Syntax::Regex::QuantifierValue True)␤» 

Modified quantifier: %, %%

To more easily match things like comma separated values, you can tack on a % modifier to any of the above quantifiers to specify a separator that must occur between each of the matches. For example, a+ % ',' will match a or a,a or a,a,a or so on. To also match trailing delimiters ( a, or a,a, ), you can use %% instead of %.

The quantifier interacts with `%` and controls the number of overall repetitions that can match successfully, so a* % ',' also matches the empty string. If you want match words delimited by commas, you might need to nest an ordinary and a modified quantifier:

say so 'abc,def' ~~ / ^ [\w+** 1 % ',' $ /;  # Output: «False» 
say so 'abc,def' ~~ / ^ [\w+** 2 % ',' $ /;  # Output: «True» 

Preventing backtracking: :

You can prevent backtracking in regexes by attaching a : modifier to the quantifier:

say so 'abababa' ~~ /.* aba/;    # OUTPUT: «True␤» 
say so 'abababa' ~~ /.*: aba/;   # OUTPUT: «False␤» 

Greedy versus frugal quantifiers: ?

By default, quantifiers request a greedy match:

'abababa' ~~ /.* a/ && say ~$/;   # OUTPUT: «abababa␤» 

You can attach a ? modifier to the quantifier to enable frugal matching:

'abababa' ~~ /.*? a/ && say ~$/;   # OUTPUT: «aba␤» 

You can also enable frugal matching for general quantifiers:

say '/foo/o/bar/' ~~ /\/.**?{1..10}\//;  # OUTPUT: «「/foo/」␤» 
say '/foo/o/bar/' ~~ /\/.**!{1..10}\//;  # OUTPUT: «「/foo/o/bar/」␤» 

Greedy matching can be explicitly requested with the ! modifier.

Alternation: ||

To match one of several possible alternatives, separate them by ||; the first matching alternative wins.

For example, ini files have the following form:

[section]
key = value

Hence, if you parse a single line of an ini file, it can be either a section or a key-value pair and the regex would be (to a first approximation):

/ '[' \w+ ']' || \S+ \s* '=' \s* \S* /

That is, either a word surrounded by square brackets, or a string of non-whitespace characters, followed by zero or more spaces, followed by the equals sign =, followed again by optional whitespace, followed by another string of non-whitespace characters.

Longest Alternation: |

In short, in regex branches separated by |, the longest token match wins, independent of the textual ordering in the regex. However, what | really does is more than that. It does not decide which branch wins after finishing the whole match, but follows the longest-token matching (LTM) strategy.

Briefly, what | does is this:

say "abc" ~~ /ab | a.* /;                 # Output: ⌜abc⌟ 
say "abc" ~~ /ab | a {} .* /;             # Output: ⌜ab⌟ 
say "if else" ~~ / if | if <.ws> else /;  # Output: 「if」 
say "if else" ~~ / if | if \s+   else /;  # Output: 「if else」 

As is shown above, a.* is a declarative prefix, while a {} .* terminates at {}, then its declarative prefix is a. Note that non-declarative atoms terminate declarative prefix. This is quite important if you want to apply | in a rule, which automatically enables :s, and <.ws> accidentally terminates declarative prefix.

say "abc" ~~ /a. | ab { print "win" } /;  # Output: win「ab」 

When two alternatives match at the same length, the tie is broken by specificity. That is, ab, as an exact match, counts as closer than a., which uses character classes.

say "abc" ~~ /a\w| a. { print "lose" } /# Output: ⌜ab⌟ 

If the tie breaker above doesn't work, then the textually earlier alternative takes precedence.

For more details, see the LTM strategy.

Conjunction: &&

Matches successfully if all &&-delimited segments match the same substring of the target string. The segments are evaluated left to right.

This can be useful for augmenting an existing regex. For example if you have a regex quoted that matches a quoted string, then / <quoted> && <-[x]>* / matches a quoted string that does not contain the character x.

Note that you cannot easily obtain the same behavior with a lookahead, that is, a regex doesn't consume characters, because a lookahead doesn't stop looking when the quoted string stops matching.

say 'abc' ~~ / <?before a> && . /;    # OUTPUT: «Nil␤» 
say 'abc' ~~ / <?before a> . && . /;  # OUTPUT: «「a」␤» 
say 'abc' ~~ / <?before a> . /;       # OUTPUT: «「a」␤» 
say 'abc' ~~ / <?before a> .. /;      # OUTPUT: «「ab」␤» 

Conjunction: &

Much like && in a regex, it matches successfully if all segments separated by & match the same part of the target string.

& (unlike &&) is considered declarative, and notionally all the segments can be evaluated in parallel, or in any order the compiler chooses.

Anchors

The regex engine tries to find a match inside a string by searching from left to right.

say so 'properly use perl' ~~ / perl/;   # OUTPUT: «True␤» 
#          ^^^^ 

But sometimes this is not what you want. Instead, you may only want to match a whole string, or a whole line, or exactly one or several whole words. Anchors or assertions can help with this.

Assertions need to match successfully in order for the whole regex to match but they do not use up characters while matching.

^, Start of String and $, End of String

The ^ assertion only matches at the start of the string:

say so 'properly' ~~ /  perl/;    # OUTPUT: «True␤» 
say so 'properly' ~~ /^ perl/;    # OUTPUT: «False␤» 
say so 'perly'    ~~ /^ perl/;    # OUTPUT: «True␤» 
say so 'perl'     ~~ /^ perl/;    # OUTPUT: «True␤» 

The $ assertion only matches at the end of the string:

say so 'use perl' ~~ /  perl  /;   # OUTPUT: «True␤» 
say so 'use perl' ~~ /  perl $/;   # OUTPUT: «True␤» 
say so 'perly'    ~~ /  perl $/;   # OUTPUT: «False␤» 

You can combine both assertions:

say so 'use perl' ~~ /^ perl $/;   # OUTPUT: «False␤» 
say so 'perl'     ~~ /^ perl $/;   # OUTPUT: «True␤» 

Keep in mind that ^ matches the start of a string, not the start of a line. Likewise, $ matches the end of a string, not the end of a line.

The following is a multi-line string:

my $str = chomp q:to/EOS/; 
   Keep it secret
   and keep it safe
   EOS
 
# 'safe' is at the end of the string 
say so $str ~~ /safe   $/;   # OUTPUT: «True␤» 
 
# 'secret' is at the end of a line, not the string 
say so $str ~~ /secret $/;   # OUTPUT: «False␤» 
 
# 'Keep' is at the start of the string 
say so $str ~~ /^Keep   /;   # OUTPUT: «True␤» 
 
# 'and' is at the start of a line -- not the string 
say so $str ~~ /^and    /;   # OUTPUT: «False␤» 

^^, Start of Line and $$, End of Line

The ^^ assertion matches at the start of a logical line. That is, either at the start of the string, or after a newline character. However, it does not match at the end of the string, even if it ends with a newline character.

$$ matches only at the end of a logical line, that is, before a newline character, or at the end of the string when the last character is not a newline character.

(To understand the following example, it's important to know that the q:to/EOS/...EOS heredoc syntax removes leading indention to the same level as the EOS marker, so that the first, second and last lines have no leading space and the third and fourth lines have two leading spaces each).

my $str = q:to/EOS/; 
    There was a young man of Japan
    Whose limericks never would scan.
      When asked why this was,
      He replied "It's because I always try to fit
    as many syllables into the last line as ever I possibly can."
    EOS
 
# 'There' is at the start of string 
say so $str ~~ /^^ There/;        # OUTPUT: «True␤» 
 
# 'limericks' is not at the start of a line 
say so $str ~~ /^^ limericks/;    # OUTPUT: «False␤» 
 
# 'as' is at start of the last line 
say so $str ~~ /^^ as/;            # OUTPUT: «True␤» 
 
# there are blanks between start of line and the "When" 
say so $str ~~ /^^ When/;         # OUTPUT: «False␤» 
 
# 'Japan' is at end of first line 
say so $str ~~ / Japan $$/;       # OUTPUT: «True␤» 
 
# there's a . between "scan" and the end of line 
say so $str ~~ / scan $$/;        # OUTPUT: «False␤» 
 
# matched at the last line 
say so $str ~~ / '."' $$/;        # OUTPUT: «True␤» 

<|w> and <!|w>, word boundary

To match any word boundary, use <|w>. This is similar to other languages' \b.

To match not a word boundary, use <!|w>. This is similar to other languages' \B.

These are both zero width assertions.

say "two-words" ~~ / "two"<|w>"-"<|w>"words" /;    # OUTPUT: «「two-words」␤» 
say "two-words" ~~ / "two"<!|w>"-"<!|w>"words" /;  # OUTPUT: «Nil␤» 

<< and >>, left and right word boundary

<< matches a left word boundary. It matches positions where there is a non-word character (i.e., \W character) at the left (or the start of the string) and a word character to the right.

>> matches a right word boundary. It matches positions where there is a word character at the left and a non-word character at the right (or the end of the string).

These are both zero width assertions.

my $str = 'The quick brown fox';
say so ' ' ~~ /\W/;               # OUTPUT: «True␤» 
say so $str ~~ /br/;              # OUTPUT: «True␤» 
say so $str ~~ /<< br/;           # OUTPUT: «True␤» 
say so $str ~~ /br >>/;           # OUTPUT: «False␤» 
say so $str ~~ /own/;             # OUTPUT: «True␤» 
say so $str ~~ /<< own/;          # OUTPUT: «False␤» 
say so $str ~~ /own >>/;          # OUTPUT: «True␤» 
say so $str ~~ /<< The/;          # OUTPUT: «True␤» 
say so $str ~~ /fox >>/;          # OUTPUT: «True␤» 

You can also use the variants « and » :

my $str = 'The quick brown fox';
say so $str ~~ /« own/;          # OUTPUT: «False␤» 
say so $str ~~ /own »/;          # OUTPUT: «True␤» 

To see the difference between <|w> and «, »:

say "stuff here!!!".subst(:g, />>/'|');   # OUTPUT: «stuff| here|!!!␤» 
say "stuff here!!!".subst(:g, /<</'|');   # OUTPUT: «|stuff |here!!!␤» 
say "stuff here!!!".subst(:g, /<|w>/'|'); # OUTPUT: «|stuff| |here|!!!␤» 

Lookahead assertions <?before pattern>

To check that a pattern appears before another pattern, use a lookahead assertion via the before assertion. This has the form:

<?before pattern>

Thus, to search for the string foo which is immediately followed by the string bar, use the following regexp:

rx{ foo <?before bar> }

For example:

say "foobar" ~~ rx{ foo <?before bar> };  # OUTPUT: «foo␤» 

However, if you want to search for a pattern which is not immediately followed by some pattern, then you need to use a negative lookahead assertion, this has the form:

<!before pattern>

Hence, all occurrences of foo which is not before bar would match with

say "foobar" ~~ rx{ foo <!before bar> }   # OUTPUT: «Nil␤» 

Lookahead assertions can be used also with other patterns, like characters ranges, interpolated variables, subscripts and so on. In such cases it does suffice to use a ? (or a ! for the negate form). For instance, the following lines all produce the very same result:

say 'abcdefg' ~~ rx{ abc <?before def> };        # OUTPUT: 「abc」 
say 'abcdefg' ~~ rx{ abc <?[ d..f ]> };          # OUTPUT: 「abc」 
my @ending_letters = <d e f>;
say 'abcdefg' ~~ rx{ abc <?@ending_letters> };   # OUTPUT: 「abc」 

Lookbehind assertions <?after pattern>

To check that a pattern appears after another pattern, use a lookbehind assertion via the after assertion. This has the form:

<?after pattern>

Therefore, to search for the string bar immediately preceded by the string foo, use the following regexp:

rx{ <?after foo> bar }

For example:

say "foobar" ~~ rx{ <?after foo> bar };   # OUTPUT: «bar␤» 

However, if you want to search for a pattern which is not immediately preceded by some pattern, then you need to use a negative lookbehind assertion, this has the form:

<!after pattern>

Hence all occurrences of bar which do not have foo before them would be matched by

say "foobar" ~~ rx{ <!after foo> bar }    # OUTPUT: «Nil␤» 

Grouping and Capturing

In regular (non-regex) Perl 6, you can use parentheses to group things together, usually to override operator precedence:

say 1 + 4 * 2;     # OUTPUT: «9␤», parsed as 1 + (4 * 2) 
say (1 + 4* 2;   # OUTPUT: «10␤» 

The same grouping facility is available in regexes:

/ a || b c /;      # matches 'a' or 'bc' 
/ ( a || b ) c /;  # matches 'ac' or 'bc' 

The same grouping applies to quantifiers:

/ a b+ /;          # matches an 'a' followed by one or more 'b's 
/ (a b)+ /;        # matches one or more sequences of 'ab' 
/ (a || b)+ /;     # matches a string of 'a's and 'b's, except empty string 

An unquantified capture produces a Match object. When a capture is quantified (except with the ? quantifier) the capture becomes a list of Match objects instead.

Capturing

The round parentheses don't just group, they also capture; that is, they make the string matched within the group available as a variable, and also as an element of the resulting Match object:

my $str =  'number 42';
if $str ~~ /'number ' (\d+/ {
    say "The number is $0";         # OUTPUT: The number is 42 
    # or 
    say "The number is $/[0]";      # OUTPUT: The number is 42 
}

Pairs of parentheses are numbered left to right, starting from zero.

if 'abc' ~~ /(a) b (c)/ {
    say "0: $0; 1: $1";             # OUTPUT: «0: a; 1: c␤» 
}

The $0 and $1 etc. syntax is shorthand. These captures are canonically available from the match object $/ by using it as a list, so $0 is actually syntactic sugar for $/[0].

Coercing the match object to a list gives an easy way to programmatically access all elements:

if 'abc' ~~ /(a) b (c)/ {
    say $/.list.join: ''  # OUTPUT: «a, c␤» 
}

Non-capturing grouping

The parentheses in regexes perform a double role: they group the regex elements inside and they capture what is matched by the sub-regex inside.

To get only the grouping behavior, you can use square brackets [ ... ] instead.

if 'abc' ~~ / [a||b] (c) / {
    say ~$0;                # OUTPUT: «c␤» 
}

If you do not need the captures, using non-capturing [ ... ] groups provides the following benefits:

Capture numbers

It is stated above that captures are numbered from left to right. While true in principle, this is also over simplification.

The following rules are listed for the sake of completeness. When you find yourself using them regularly, it's worth considering named captures (and possibly subrules) instead.

Alternations reset the capture count:

/ (x) (y)  || (a) (.) (./
# $0  $1      $0  $1  $2 

Example:

if 'abc' ~~ /(x)(y) || (a)(.)(.)/ {
    say ~$1;        # OUTPUT: «b␤» 
}

If two (or more) alternations have a different number of captures, the one with the most captures determines the index of the next capture:

if 'abcd' ~~ / a [ b (.|| (x) (y) ] (./ {
    #                 $0     $0  $1    $2 
    say ~$2;            # OUTPUT: «d␤» 
}

Captures can be nested, in which case they are numbered per level

if 'abc' ~~ / ( a (.) (.) ) / {
    say "Outer: $0";                # OUTPUT: Outer: abc 
    say "Inner: $0[0] and $0[1]";   # OUTPUT: Inner: b and c 
}

If you need to refer to a capture from within another capture, store it in a variable first:

# !!WRONG!! The $0 refers to a capture *inside* the second capture 
say "11" ~~ /(\d) ($0)/# OUTPUT: «Nil␤» 
 
# CORRECT: $0 is saved into a variable outside the second capture 
# before it is used inside (the `{}` is needed to update the current match) 
say "11" ~~ /(\d{} :my $c = $0; ($c)/;
# OUTPUT: «「11」␤ 0 => 「1」␤ 1 => 「1」␤» 

Named captures

Instead of numbering captures, you can also give them names. The generic, and slightly verbose, way of naming captures is like this:

if 'abc' ~~ / $<myname> = [ \w+ ] / {
    say ~$<myname>      # OUTPUT: «abc␤» 
}

The access to the named capture, $<myname>, is a shorthand for indexing the match object as a hash, in other words: $/{ 'myname' } or $/<myname>.

Named captures can also be nested using regular capture group syntax:

if 'abc-abc-abc' ~~ / $<string>=( [ $<part>=[abc] ]* % '-' ) / {
    say ~$<string>;          # OUTPUT: «abc-abc-abc␤» 
    say ~$<string><part>;    # OUTPUT: «abc abc abc␤» 
    say ~$<string><part>[0]; # OUTPUT: «abc␤» 
}

Coercing the match object to a hash gives you easy programmatic access to all named captures:

if 'count=23' ~~ / $<variable>=\w+ '=' $<value>=\w+ / {
    my %h = $/.hash;
    say %h.keys.sort.join: '';        # OUTPUT: «value, variable␤» 
    say %h.values.sort.join: '';      # OUTPUT: «23, count␤» 
    for %h.kv -> $k$v {
        say "Found value '$v' with key '$k'";
        # outputs two lines: 
        #   Found value 'count' with key 'variable' 
        #   Found value '23' with key 'value' 
    }
}

A more convenient way to get named captures is discussed in the Subrules section.

Capture markers: <( )>

A <( token indicates the start of the match's overall capture, while the corresponding )> token indicates its endpoint. The <( is similar to other languages \K to discard any matches found before the \K.

say 'abc' ~~ / a <( b )> c/;            # OUTPUT: «「b」␤» 
say 'abc' ~~ / <(<( b )> c)>/;        # OUTPUT: «「bc」␤» 

As the example above, you can see <( sets the start point and )> sets the endpoint; since they are actually independent each other, the inner-most start point wins (the one attache to b) and the outer-most end wins (the one attached to c).

Substitution

Regular expressions can also be used to substitute one piece of text for another. You can use this for anything, from correcting a spelling error (e.g., replacing 'Perl Jam' with 'Pearl Jam'), to reformatting an ISO8601 date from yyyy-mm-ddThh:mm:ssZ to mm-dd-yy h:m {AM,PM} and beyond.

Just like the search-and-replace editor's dialog box, the s/ / / operator has two sides, a left and right side. The left side is where your matching expression goes, and the right side is what you want to replace it with.

Lexical conventions

Substitutions are written similarly to matching, but the substitution operator has both an area for the regex to match, and the text to substitute:

s/replace/with/;           # a substitution that is applied to $_ 
$str ~~ s/replace/with/;   # a substitution applied to a scalar 

The substitution operator allows delimiters other than the slash:

s|replace|with|;
s!replace!with!;
s,replace,with,;

Note that neither the colon : nor balancing delimiters such as {} or () can be substitution delimiters. Colons clash with adverbs such as s:i/Foo/bar/ and the other delimiters are used for other purposes.

If you use balancing brackets, the substitution works like this instead:

s[replace] = 'with';

The right-hand side is now a (not quoted) Perl 6 expression, in which $/ is available as the current match:

$_ = 'some 11 words 21';
s:g[ \d+ ] =  2 * $/;
.say;                    # OUTPUT: «some 22 words 42␤» 

Like the m// operator, whitespace is ignored in the regex part of a substitution. Comments, as in Perl 6 in general, start with the hash character # and go to the end of the current line.

Replacing string literals

The simplest thing to replace is a string literal. The string you want to replace goes on the left-hand side of the substitution operator, and the string you want to replace it with goes on the right-hand side; for example:

$_ = 'The Replacements';
s/Replace/Entrap/;
.say;                    # OUTPUT: «The Entrapments␤» 

Alphanumeric characters and the underscore are literal matches, just as in its cousin the m// operator. All other characters must be escaped with a backslash \ or included in quotes:

$_ = 'Space: 1999';
s/Space\:/Party like it's/;
.say                        # OUTPUT: «Party like it's 1999␤» 

Note that the matching restrictions only apply to the left-hand side of the substitution expression.

By default, substitutions are only done on the first match:

$_ = 'There can be twly two';
s/tw/on/;                     # replace 'tw' with 'on' once 
.say;                         # OUTPUT: «There can be only two␤» 

Wildcards and character classes

Anything that can go into the m// operator can go into the left-hand side of the substitution operator, including wildcards and character classes. This is handy when the text you're matching isn't static, such as trying to match a number in the middle of a string:

$_ = "Blake's 9";
s/\d+/7/;         # replace any sequence of digits with '7' 
.say;             # OUTPUT: «Blake's 7␤» 

Of course, you can use any of the +, * and ? modifiers, and they'll behave just as they would in the m// operator's context.

Capturing Groups

Just as in the match operator, capturing groups are allowed on the left-hand side, and the matched contents populate the $0..$n variables and the $/ object:

$_ = '2016-01-23 18:09:00';
s/ (\d+)\-(\d+)\-(\d+/today/;   # replace YYYY-MM-DD with 'today' 
.say;                             # OUTPUT: «today 18:09:00␤» 
"$1-$2-$0".say;                   # OUTPUT: «01-23-2016␤» 
"$/[1]-$/[2]-$/[0]".say;          # OUTPUT: «01-23-2016␤» 

Any of these variables $0, $1, $/ can be used on the right-hand side of the operator as well, so you can manipulate what you've just matched. This way you can separate out the YYYY, MM and DD parts of a date and reformat them into MM-DD-YYYY order:

$_ = '2016-01-23 18:09:00';
s/ (\d+)\-(\d+)\-(\d+/$1-$2-$0/;    # transform YYYY-MM-DD to MM-DD-YYYY 
.say;                                 # OUTPUT: «01-23-2016 18:09:00␤» 

Named capture can be used too:

$_ = '2016-01-23 18:09:00';
s/ $<y>=(\d+)\-$<m>=(\d+)\-$<d>=(\d+/$<m>-$<d>-$<y>/;
.say;                                 # OUTPUT: «01-23-2016 18:09:00␤» 

Since the right-hand side is effectively a regular Perl 6 interpolated string, you can reformat the time from HH:MM to h:MM {AM,PM} like so:

$_ = '18:38';
s/(\d+)\:(\d+)/{$0 % 12}\:$1 {$0 < 12 ?? 'AM' !! 'PM'}/;
.say;                                 # OUTPUT: «6:38 PM␤» 

Using the modulo % operator above keeps the sample code under 80 characters, but is otherwise the same as $0 < 12 ?? $0 !! $0 - 12 . When combined with the power of the Parser Expression Grammars that really underlies what you're seeing here, you can use "regular expressions" to parse pretty much any text out there.

Common adverbs

The full list of adverbs that you can apply to regular expressions can be found elsewhere in this document (section Adverbs), but the most common are probably :g and :i.

Ordinarily, matches are only made once in a given string, but adding the :g modifier overrides that behavior, so that substitutions are made everywhere possible. Substitutions are non-recursive; for example:

$_ = q{I can say "banana" but I don't know when to stop};
s:g/na/nana,/;    # substitute 'nana,' for 'na' 
.say;             # OUTPUT: «I can say "banana,nana," but I don't ...␤» 

Here, na was found twice in the original string and each time there was a substitution. The substitution only applied to the original string, though. The resulting string was not impacted.

Ordinarily, matches are case-sensitive. s/foo/bar/ will only match 'foo' and not 'Foo'. If the adverb :i is used, though, matches become case-insensitive.

$_ = 'Fruit';
s/fruit/vegetable/;
.say;                          # OUTPUT: «Fruit␤» 
 
s:i/fruit/vegetable/;
.say;                          # OUTPUT: «vegetable␤» 

For more information on what these adverbs are actually doing, refer to the section Adverbs section of this document.

These are just a few of the transformations you can apply with the substitution operator. Some of the simpler uses in the real world include removing personal data from log files, editing MySQL timestamps into PostgreSQL format, changing copyright information in HTML files and sanitizing form fields in a web application.

As an aside, novices to regular expressions often get overwhelmed and think that their regular expression needs to match every piece of data in the line, including what they want to match. Write just enough to match the data you're looking for, no more, no less.

Tilde for nesting structures

The ~ operator is a helper for matching nested subrules with a specific terminator as the goal. It is designed to be placed between an opening and closing bracket, like so:

/ '(' ~ ')' <expression> /

However, it mostly ignores the left argument, and operates on the next two atoms (which may be quantified). Its operation on those next two atoms is to "twiddle" them so that they are actually matched in reverse order. Hence the expression above, at first blush, is merely shorthand for:

/ '(' <expression> ')' /

But beyond that, when it rewrites the atoms it also inserts the apparatus that will set up the inner expression to recognize the terminator, and to produce an appropriate error message if the inner expression does not terminate on the required closing atom. So it really does pay attention to the left bracket as well, and it actually rewrites our example to something more like:

$<OPEN> = '(' <SETGOAL: ')'> <expression> [ $GOAL || <FAILGOAL> ]

FAILGOAL is a special method that can be defined by the user and it will be called on parse failure:

grammar A { token TOP { '[' ~ ']' \w+  };
            method FAILGOAL($goal{
                die "Cannot find $goal near position {self.pos}"
            }
}
 
say A.parse: '[good]';  # OUTPUT: «「[good]」␤» 
A.parse: '[bad';        # will throw FAILGOAL exception 
CATCH { default { put .^name''.Str } };
# OUTPUT: «X::AdHoc: Cannot find ']'  near position 4␤» 

Note that you can use this construct to set up expectations for a closing construct even when there's no opening bracket:

"3)"  ~~ / <?> ~ ')' \d+ /;  # RESULT: «「3)」» 
"(3)" ~~ / <?> ~ ')' \d+ /;  # RESULT: «「3)」» 

Here <?> returns true on the first null string.

The order of the regex capture is original:

"abc" ~~ /a ~ (c) (b)/;
say $0# OUTPUT: «「c」␤» 
say $1# OUTPUT: «「b」␤» 

Subrules

Just like you can put pieces of code into subroutines, you can also put pieces of regex into named rules.

my regex line { \N*\n }
if "abc\ndef" ~~ /<line> def/ {
    say "First line: "$<line>.chomp;      # OUTPUT: «First line: abc␤» 
}

A named regex can be declared with my regex named-regex { body here }, and called with <named-regex>. At the same time, calling a named regex installs a named capture with the same name.

To give the capture a different name from the regex, use the syntax <capture-name=named-regex>. If no capture is desired, a leading dot will suppress it: <.named-regex>.

Here's more complete code for parsing ini files:

my regex header { \s* '[' (\w+']' \h* \n+ }
my regex identifier  { \w+ }
my regex kvpair { \s* <key=identifier> '=' <value=identifier> \n+ }
my regex section {
    <header>
    <kvpair>*
}
 
my $contents = q:to/EOI/; 
    [passwords]
        jack=password1
        joy=muchmoresecure123
    [quotas]
        jack=123
        joy=42
EOI
 
my %config;
if $contents ~~ /<section>*/ {
    for $<section>.list -> $section {
        my %section;
        for $section<kvpair>.list -> $p {
            %section{ $p<key> } = ~$p<value>;
        }
        %config{ $section<header>[0} = %section;
    }
}
say %config.perl;
 
# OUTPUT: «{:passwords(${:jack("password1"), :joy("muchmoresecure123")}), 
#           :quotas(${:jack("123"), :joy("42")})}» 

Named regexes can and should be grouped in grammars. A list of predefined subrules is listed in S05-regex of design documents.

Regex Interpolation

If you want to build a regex using a pattern given at runtime, regex interpolation is what you are looking for.

There are four ways you can interpolate a string into regex as a pattern. That is using $pattern, $($pattern), <$pattern> or <{$pattern.method}>.

If the variable to be interpolated is statically typed as a Str or str (like $pattern0 and $pattern2 are below) and only interpolated literally (like the first example below), than the compiler can optimize that and it runs much faster.

my Str $text = 'camelia';
my Str $pattern0 = 'camelia';
my     $pattern1 = 'ailemac';
my str $pattern2 = '\w+';
 
say $text ~~ / $pattern0 /;                # OUTPUT: «「camelia」␤» 
say $text ~~ / $($pattern0/;             # OUTPUT: «「camelia」␤» 
say $text ~~ / $($pattern1.flip) /;        # OUTPUT: «「camelia」␤» 
say 'ailemacxflip' ~~ / $pattern1.flip /;  # OUTPUT: «「ailemacxflip」␤» 
say '\w+' ~~ / $pattern2 /;                # OUTPUT: «「\w+」␤» 
say '\w+' ~~ / $($pattern2/;             # OUTPUT: «「\w+」␤» 
 
say $text ~~ / <{$pattern1.flip}> /;       # OUTPUT: «「camelia」␤» 
# say $text ~~ / <$pattern1.flip> /;       # !!Compile Error!! 
say $text ~~ / <$pattern2> /;              # OUTPUT: «「camelia」␤» 
say $text ~~ / <{$pattern2}> /;            # OUTPUT: «「camelia」␤» 

Note that the first two syntax interpolate the string lexically, while <$pattern> and <{$pattern.method}> causes implicit EVAL, which is a known trap.

When an array variable is interpolated into a regex, the regex engine handles it like a | alternative of the regex elements. The interpolation rules for individual elements are the same as for scalars, so strings and numbers match literally, and /type/Regex objects match as regexes. Just as with ordinary | interpolation, the longest match succeeds:

my @a = '2'23, rx/a.+/;
say ('b235' ~~ /  b @a /).Str;      # OUTPUT: «b23» 

The use of hash variables in regexes is preserved.

Regex Boolean condition check

The special operator <?{}> allows the evaluation of a boolean expression that can perform a semantic evaluation of the match before the regular expression continues. In other words, it is possible to check in a boolean context a part of a regular expression and therefore invalidate the whole match (or allow it to continue) even if the match succeed from a syntactic point of view.

In particular the <?{}> operator requires a True value in order to allow the regular expression to match, while its negate form <!{}> requires a False value.

In order to demonstrate the above operator, please consider the following example that involves a simple IPv4 address matching:

my $localhost = '127.0.0.1';
my regex ipv4-octet { \d ** 1..<?{ True }> }
$localhost ~~ / ^ <ipv4-octet> ** 4 % "." $ /;
say $/<ipv4-octet>;   # OUTPUT: [「127」 「0」 「0」 「1」] 

The octet regular expression matches against a number made by one up to three digits. Each match is driven by the result of the <?{}>, that being the fixed value of True means that the regular expression match has to be always considered as good. As a counter-example, using the special constant value False will invalidate the match even if the regular expression matches from a syntactic point of view:

my $localhost = '127.0.0.1';
my regex ipv4-octet { \d ** 1..<?{ False }> }
$localhost ~~ / ^ <ipv4-octet> ** 4 % "." $ /;
say $/<ipv4-octet>;   # OUTPUT: Nil 

From the above examples, it should be clear that it is possible to improve the semantic check, for instance ensuring that each octet is really a valid IPv4 octet:

my $localhost = '127.0.0.1';
my regex ipv4-octet { \d ** 1..<?{ $/.Int <= 255 && $/.Int >= 0 }> }
$localhost ~~ / ^ <ipv4-octet> ** 4 % "." $ /;
say $/<ipv4-octet>;   # OUTPUT: [「127」 「0」 「0」 「1」] 

Please note that it is not required to evaluate the regular expression in-line, but also a regular method can be called to get the boolean value:

my $localhost = '127.0.0.1';
sub check-octet ( Int $o ){ $o <= 255 && $o >= 0 }
my regex ipv4-octet { \d ** 1..<?{ &check-octet$/.Int ) }> }
$localhost ~~ / ^ <ipv4-octet> ** 4 % "." $ /;
say $/<ipv4-octet>;   # OUTPUT: [「127」 「0」 「0」 「1」] 

Of course, being <!{}> the negation form of <?{}> the same boolean evaluation can be rewritten in a negated form:

my $localhost = '127.0.0.1';
sub invalid-octetInt $o ){ $o < 0 || $o > 255 }
my regex ipv4-octet { \d ** 1..<!{ &invalid-octet$/.Int ) }> }
$localhost ~~ / ^ <ipv4-octet> ** 4 % "." $ /;
say $/<ipv4-octet>;   # OUTPUT: [「127」 「0」 「0」 「1」] 

Adverbs

Adverbs modify how regexes work and provide convenient shortcuts for certain kinds of recurring tasks.

There are two kinds of adverbs: regex adverbs apply at the point where a regex is defined and matching adverbs apply at the point that a regex matches against a string.

This distinction often blurs, because matching and declaration are often textually close but using the method form of matching makes the distinction clear.

'abc' ~~ /../ is roughly equivalent to 'abc'.match(/../), or even more clearly written in separate lines:

my $regex = /../;           # definition 
if 'abc'.match($regex{    # matching 
    say "'abc' has at least two characters";
}

Regex adverbs like :i go into the definition line and matching adverbs like :overlap are appended to the match call:

my $regex = /:i . a/;
for 'baA'.match($regex:overlap-> $m {
    say ~$m;
}
# OUTPUT: «ba␤aA␤» 

Regex Adverbs

Adverbs that appear at the time of a regex declaration are part of the actual regex and influence how the Perl 6 compiler translates the regex into binary code.

For example, the :ignorecase (:i) adverb tells the compiler to ignore the distinction between upper case, lower case and title case letters.

So 'a' ~~ /A/ is false, but 'a' ~~ /:i A/ is a successful match.

Regex adverbs can come before or inside a regex declaration and only affect the part of the regex that comes afterwards, lexically. Note that regex adverbs appearing before the regex must appear after something that introduces the regex to the parser, like 'rx' or 'm' or a bare '/'. This is NOT valid:

my $rx1 = :i/a/;      # adverb is before the regex is recognized => exception 

but these are valid:

my $rx1 = rx:i/a/;     # before 
my $rx2 = m:i/a/;      # before 
my $rx3 = /:i a/;      # inside 

These two regexes are equivalent:

my $rx1 = rx:i/a/;      # before 
my $rx2 = rx/:i a/;     # inside 

Whereas these two are not:

my $rx3 = rx/:i b/;   # matches only the b case insensitively 
my $rx4 = rx/:i a b/;   # matches completely case insensitively 

Brackets and parentheses limit the scope of an adverb:

/ (:i a b) c /;         # matches 'ABc' but not 'ABC' 
/ [:i a b] c /;         # matches 'ABc' but not 'ABC' 

Ignoremark

The :ignoremark or :m adverb instructs the regex engine to only compare base characters, and ignore additional marks such as combining accents:

say so 'a' ~~ rx/ä/;                # OUTPUT: «False» 
say so 'a' ~~ rx:ignoremark /ä/;    # OUTPUT: «True» 
say so '' ~~ rx:ignoremark /o/;    # OUTPUT: «True> 

Ratchet

The :ratchet or :r adverb causes the regex engine to not backtrack (see backtracking).

Without this adverb, parts of a regex will try different ways to match a string in order to make it possible for other parts of the regex to match. For example, in 'abc' ~~ /\w+ ./, the \w+ first eats up the whole string, abc but then the . fails. Thus \w+ gives up a character, matching only ab, and the . can successfully match the string c. This process of giving up characters (or in the case of alternations, trying a different branch) is known as backtracking.

say so 'abc' ~~ / \w+ . /;        # OUTPUT: «True␤» 
say so 'abc' ~~ / :r \w+ . /;     # OUTPUT: «False␤» 

Ratcheting can be an optimization, because backtracking is costly. But more importantly, it closely corresponds to how humans parse a text. If you have a regex my regex identifier { \w+ } and my regex keyword { if | else | endif }, you intuitively expect the identifier to gobble up a whole word and not have it give up its end to the next rule, if the next rule otherwise fails.

For example, you don't expect the word motif to be parsed as the identifier mot followed by the keyword if. Instead, you expect motif to be parsed as one identifier; and if the parser expects an if afterwards, best that it should fail than have it parse the input in a way you don't expect.

Since ratcheting behavior is often desirable in parsers, there's a shortcut to declaring a ratcheting regex:

my token thing { .... }
# short for 
my regex thing { :r ... }

Sigspace

The :sigspace or :s adverb makes whitespace significant in a regex.

say so "I used Photoshop®"   ~~ m:i/   photo shop /;      # OUTPUT: «True␤»
say so "I used a photo shop" ~~ m:i:s/ photo shop /;   # OUTPUT: «True␤»
say so "I used Photoshop®"   ~~ m:i:s/ photo shop /;   # OUTPUT: «False␤»

m:s/ photo shop / acts the same as m/ photo <.ws> shop <.ws> /. By default, <.ws> makes sure that words are separated, so a b and ^& will match <.ws> in the middle, but ab won't.

Where whitespace in a regex turns into <.ws> depends on what comes before the whitespace. In the above example, whitespace in the beginning of a regex doesn't turn into <.ws>, but whitespace after characters does. In general, the rule is that if a term might match something, whitespace after it will turn into <.ws>.

In addition, if whitespace comes after a term but before a quantifier (+, *, or ?), <.ws> will be matched after every match of the term. So, foo + becomes [ foo <.ws> ]+. On the other hand, whitespace after a quantifier acts as normal significant whitespace; e.g., "foo+" becomes foo+ <.ws>.

In all, this code:

rx :s {
    ^^
    {
        say "No sigspace after this";
    }
    <.assertion_and_then_ws>
    characters_with_ws_after+
    ws_separated_characters *
    [
    | some "stuff" .. .
    | $$
    ]
    :my $foo = "no ws after this";
    $foo
}

Becomes:

rx {
    ^^ <.ws>
    {
        say "No space after this";
    }
    <.assertion_and_then_ws> <.ws>
    characters_with_ws_after+ <.ws>
    [ws_separated_characters <.ws>]* <.ws>
    [
    | some <.ws> "stuff" <.ws> .. <.ws> . <.ws>
    | $$ <.ws>
    ] <.ws>
    :my $foo = "no ws after this";
    $foo <.ws>
}

If a regex is declared with the rule keyword, both the :sigspace and :ratchet adverbs are implied.

Grammars provide an easy way to override what <.ws> matches:

grammar Demo {
    token ws {
        <!ww>       # only match when not within a word 
        \h*         # only match horizontal whitespace 
    }
    rule TOP {      # called by Demo.parse; 
        a b '.'
    }
}
 
# doesn't parse, whitespace required between a and b 
say so Demo.parse("ab.");                 # OUTPUT: «False␤» 
say so Demo.parse("a b.");                # OUTPUT: «True␤» 
say so Demo.parse("a\tb .");              # OUTPUT: «True␤» 
 
# \n is vertical whitespace, so no match 
say so Demo.parse("a\tb\n.");             # OUTPUT: «False␤» 

When parsing file formats where some whitespace (for example, vertical whitespace) is significant, it's advisable to override ws.

Perl 5 compatibility adverb

The :Perl5 or :P5 adverb switch the Regex parsing and matching to the way Perl 5 regexes behave:

so 'hello world' ~~ m:Perl5/^hello (world)/;   # OUTPUT: «True␤» 
so 'hello world' ~~ m/^hello (world)/;         # OUTPUT: «False␤» 
so 'hello world' ~~ m/^ 'hello ' ('world')/;   # OUTPUT: «True␤» 

The regular behavior is recommended and more idiomatic in Perl 6 of course, but the :Perl5 adverb can be useful when compatibility with Perl5 is required.

Matching adverbs

In contrast to regex adverbs, which are tied to the declaration of a regex, matching adverbs only make sense when matching a string against a regex.

They can never appear inside a regex, only on the outside – either as part of an m/.../ match or as arguments to a match method.

Continue

The :continue or short :c adverb takes an argument. The argument is the position where the regex should start to search. By default, it searches from the start of the string, but :c overrides that. If no position is specified for :c, it will default to 0 unless $/ is set, in which case, it defaults to $/.to.

given 'a1xa2' {
    say ~m/a./;         # OUTPUT: «a1␤» 
    say ~m:c(2)/a./;    # OUTPUT: «a2␤» 
}

Note: unlike :pos, a match with :continue() will attempt to match further in the string, instead of failing:

say "abcdefg" ~~ m:c(3)/e.+/# OUTPUT: «「efg」␤» 
say "abcdefg" ~~ m:p(3)/e.+/# OUTPUT: «False␤» 

Exhaustive

To find all possible matches of a regex – including overlapping ones – and several ones that start at the same position, use the :exhaustive (short :ex) adverb.

given 'abracadabra' {
    for m:exhaustive/ a .* a / -> $match {
        say ' ' x $match.from~$match;
    }
}

The above code produces this output:

abracadabra
abracada
abraca
abra
   acadabra
   acada
   aca
     adabra
     ada
       abra

Global

Instead of searching for just one match and returning a Match object, search for every non-overlapping match and return them in a List. In order to do this, use the :global adverb:

given 'several words here' {
    my @matches = m:global/\w+/;
    say @matches.elems;         # OUTPUT: «3␤» 
    say ~@matches[2];           # OUTPUT: «here␤» 
}

:g is shorthand for :global.

Pos

Anchor the match at a specific position in the string:

given 'abcdef' {
    my $match = m:pos(2)/.*/;
    say $match.from;        # OUTPUT: «2␤» 
    say ~$match;            # OUTPUT: «cdef␤» 
}

:p is shorthand for :pos.

Note: unlike :continue, a match anchored with :pos() will fail, instead of attempting to match further down the string:

say "abcdefg" ~~ m:c(3)/e.+/# OUTPUT: «「efg」␤» 
say "abcdefg" ~~ m:p(3)/e.+/# OUTPUT: «False␤» 

Overlap

To get several matches, including overlapping matches, but only one (the longest) from each starting position, specify the :overlap (short :ov) adverb:

given 'abracadabra' {
    for m:overlap/ a .* a / -> $match {
        say ' ' x $match.from~$match;
    }
}

produces

abracadabra
   acadabra
     adabra
       abra

Substitution Adverbs

You can apply matching adverbs (such as :global, :pos etc.) to substitutions. In addition, there are adverbs that only make sense for substitutions, because they transfer a property from the matched string to the replacement string.

Samecase

The :samecase or :ii substitution adverb implies the :ignorecase adverb for the regex part of the substitution, and in addition carries the case information to the replacement string:

$_ = 'The cat chases the dog';
s:global:samecase[the] = 'a';
say $_;                 # OUTPUT: «A cat chases a dog» 

Here you can see that the first replacement string a got capitalized, because the first string of the matched string was also a capital letter.

Samemark

The :samemark or :mm adverb implies :ignoremark for the regex, and in addition, copies the markings from the matched characters to the replacement string:

given 'äộñ' {
    say S:mm/ a .+ /uia/;           # OUTPUT: «üị̂ã» 
}

Samespace

The :samespace or :ss substitution modifier implies the :sigspace modifier for the regex, and in addition, copies the whitespace from the matched string to the replacement string:

say S:samespace/./c d/.perl given "a b";      # OUTPUT: «"c d"» 
say S:samespace/./c d/.perl given "a\tb";     # OUTPUT: «"c\td"» 
say S:samespace/./c d/.perl given "a\nb";     # OUTPUT: «"c\nd"» 

The ss/.../.../ syntactic form is a shorthand for s:samespace/.../.../.

Best practices and gotchas

To help with robust regexes and grammars, here are some best practices for code layout and readability, what to actually match, and avoiding common pitfalls.

Code layout

Without the :sigspace adverb, whitespace is not significant in Perl 6 regexes. Use that to your own advantage and insert whitespace where it increases readability. Also, insert comments where necessary.

Compare the very compact

my regex float { <[+-]>?\d*'.'\d+[e<[+-]>?\d+]? }

to the more readable

my regex float {
     <[+-]>?        # optional sign 
     \d*            # leading digits, optional 
     '.'
     \d+
     [              # optional exponent 
        e <[+-]>?  \d+
     ]?
}

As a rule of thumb, use whitespace around atoms and inside groups; put quantifiers directly after the atom; and vertically align opening and closing brackets and parentheses.

When you use a list of alternations inside a parenthesis or brackets, align the vertical bars:

my regex example {
    <preamble>
    [
    || <choice_1>
    || <choice_2>
    || <choice_3>
    ]+
    <postamble>
}

Keep it small

Regexes are often more compact than regular code. Because they do so much with so little, keep regexes short.

When you can name a part of a regex, it's usually best to put it into a separate, named regex.

For example, you could take the float regex from earlier:

my regex float {
     <[+-]>?        # optional sign 
     \d*            # leading digits, optional 
     '.'
     \d+
     [              # optional exponent 
        e <[+-]>?  \d+
     ]?
}

And decompose it into parts:

my token sign { <[+-]> }
my token decimal { \d+ }
my token exponent { 'e' <sign>? <decimal> }
my regex float {
    <sign>?
    <decimal>?
    '.'
    <decimal>
    <exponent>?
}

That helps, especially when the regex becomes more complicated. For example, you might want to make the decimal point optional in the presence of an exponent.

my regex float {
    <sign>?
    [
    || <decimal>?  '.' <decimal> <exponent>?
    || <decimal> <exponent>
    ]
}

What to match

Often the input data format has no clear-cut specification, or the specification is not known to the programmer. Then, it's good to be liberal in what you expect, but only so long as there are no possible ambiguities.

For example, in ini files:

[section]
key=value

What can be inside the section header? Allowing only a word might be too restrictive. Somebody might write [two words], or use dashes, or so on. Instead of asking what's allowed on the inside, it might be worth asking instead: what's not allowed?

Clearly, closing brackets are not allowed, because [a]b] would be ambiguous. By the same argument, opening brackets should be forbidden. This leaves us with

token header { '[' <-[ \[\] ]>+ ']' }

which is fine if you are only processing one line. But if you're processing a whole file, suddenly the regex parses

[with a
newline in between]

which might not be a good idea. A compromise would be

token header { '[' <-[ \[\] \n ]>+ ']' }

and then, in the post-processing, strip leading and trailing spaces and tabs from the section header.

Matching Whitespace

The :sigspace adverb (or using the rule declarator instead of token or regex) is very handy for implicitly parsing whitespace that can appear in many places.

Going back to the example of parsing ini files, we have

my regex kvpair { \s* <key=identifier> '=' <value=identifier> \n+ }

which is probably not as liberal as we want it to be, since the user might put spaces around the equals sign. So, then we may try this:

my regex kvpair { \s* <key=identifier> \s* '=' \s* <value=identifier> \n+ }

But that's looking unwieldy, so we try something else:

my rule kvpair { <key=identifier> '=' <value=identifier> \n+ }

But wait! The implicit whitespace matching after the value uses up all whitespace, including newline characters, so the \n+ doesn't have anything left to match (and rule also disables backtracking, so no luck there).

Therefore, it's important to redefine your definition of implicit whitespace to whitespace that is not significant in the input format.

This works by redefining the token ws; however, it only works for grammars:

grammar IniFormat {
    token ws { <!ww> \h* }
    rule header { \s* '[' (\w+']' \n+ }
    token identifier  { \w+ }
    rule kvpair { \s* <key=identifier> '=' <value=identifier> \n+ }
    token section {
        <header>
        <kvpair>*
    }
 
    token TOP {
        <section>*
    }
}
 
my $contents = q:to/EOI/; 
    [passwords]
        jack = password1
        joy = muchmoresecure123
    [quotas]
        jack = 123
        joy = 42
EOI
say so IniFormat.parse($contents);

Besides putting all regexes into a grammar and turning them into tokens (because they don't need to backtrack anyway), the interesting new bit is

token ws { <!ww> \h* }

which gets called for implicit whitespace parsing. It matches when it's not between two word characters (<!ww> , negated "within word" assertion), and zero or more horizontal space characters. The limitation to horizontal whitespace is important, because newlines (which are vertical whitespace) delimit records and shouldn't be matched implicitly.

Still, there's some whitespace-related trouble lurking. The regex \n+ won't match a string like "\n \n", because there's a blank between the two newlines. To allow such input strings, replace \n+ with \n\s*.

Backtracking

Perl 6 defaults to backtracking when evaluating regular expressions. Backtracking is a technique that allows the engine to try different matching in order to allow every part of a regular expression to succeed. This is costly, because it requires the engine to usually eat up as much as possible in the first match and then adjust going backwards in order to ensure all regular expression parts have a chance to match.

In order to better understand backtracking, consider the following example:

my $string = 'PostgreSQL is an SQL database!';
say $string ~~ /(.+)(SQL) (.+$1/# OUTPUT: 「PostgreSQL is an SQL」 

What happens in the above example is that the string has to be matched against the second occurrence of the word SQL, eating all characters before and leaving out the rest.

Since it is possible to execute a piece of code within a regular expression, it is also possible to inspect the Match object within the regular expression itself:

my $iteration = 0;
sub show-capturesMatch $m ){
    my Str $result_split;
    say "\n=== Iteration {++$iteration} ===";
    for $m.list.kv -> $i$capture {
        say "Capture $i = $capture";
        $result_split ~= '[' ~ $capture ~ ']';
    }
 
    say $result_split;
}
 
$string ~~ /(.+)(SQL) (.+$1 (.+{ show-captures$/ );  }/;

The show-captures method will dump all the elements of $/ producing the following output:

=== Iteration 1 ===
Capture 0 = Postgre
Capture 1 = SQL
Capture 2 =  is an
[Postgre][SQL][ is an ]

showing that the string has been split around the second occurrence of SQL, that is the repetition of the first capture ($/[1]).

With that in place, it is now possible to see how the engine backtracks to find the above match: it does suffice to move the show-captures in the middle of the regular expression, in particular before the repetition of the first capture $1 to see it in action:

my $iteration = 0;
sub show-capturesMatch $m ){
    my Str $result-split;
    say "\n=== Iteration {++$iteration} ===";
    for $m.list.kv -> $i$capture {
        say "Capture $i = $capture";
        $result-split ~= '[' ~ $capture ~ ']';
    }
 
    say $result-split;
}
 
$string ~~ / (.+)(SQL) (.+{ show-captures$/ );  } $1 /;

The output will be much more verbose and will show several iterations, with the last one being the winning. The following is an excerpt of the output:

=== Iteration 1 ===
Capture 0 = PostgreSQL is an
Capture 1 = SQL
Capture 2 =  database!
[PostgreSQL is an ][SQL][ database!]
 
=== Iteration 2 ===
Capture 0 = PostgreSQL is an
Capture 1 = SQL
Capture 2 =  database
[PostgreSQL is an ][SQL][ database]
 
...
 
=== Iteration 24 ===
Capture 0 = Postgre
Capture 1 = SQL
Capture 2 =  is an
[Postgre][SQL][ is an ]

In the first iteration the SQL part of PostgreSQL is kept within the word: that is not what the regular expression asks for, so there's the need for another iteration. The second iteration will move back, in particular one character back (removing thus the final !) and try to match again, resulting in a fail since again the SQL is still kept within PostgreSQL. After several iterations, the final result is match.

It is worth noting that the final iteration is number 24, and that such number is exactly the distance, in number of chars, from the end of the string to the first SQL occurrence:

say $string.chars - $string.index: 'SQL'# OUTPUT: 23 

Since there are 23 chars from the very end of the string to the very first S of SQL the backtracking engine will need 23 "useless" matches to find the right one, that is will need 24 steps to get the final result.

Backtracking is a costly machinery, therefore it is possible to disable it in those cases where the matching can be found forward only.

With regards to the above example, disabling backtracking means the regular expression will not have any chance to match:

say $string ~~ /(.+)(SQL) (.+$1/;      # OUTPUT: 「PostgreSQL is an SQL」 
say $string ~~ / :r (.+)(SQL) (.+$1/;  # OUTPUT: Nil 

The fact is that, as shown in the iteration 1 output, the first match of the regular expression engine will be PostgreSQL is an , SQL, database that does not leave out any room for matching another occurrence of the word SQL (as $1 in the regular expression). Since the engine is not able to get backward and change the path to match, the regular expression fails.

It is worth noting that disabling backtracking will not prevent the engine to try several ways to match the regular expression. Consider the following slightly changed example:

my $string = 'PostgreSQL is an SQL database!';
say $string ~~ / (SQL) (.+$1 /# OUTPUT: Nil 

Since there is no specification for a character before the word SQL, the engine will match against the rightmost word SQL an go forward from there. Since there is no repetition of SQL remaining, the match fails. It is possible, again, to inspect what the engine performs introducing a dumping piece of code within the regular expression:

my $iteration = 0;
sub show-capturesMatch $m ){
    my Str $result-split;
    say "\n=== Iteration {++$iteration} ===";
    for $m.list.kv -> $i$capture {
        say "Capture $i = $capture";
        $result-split ~= '[' ~ $capture ~ ']';
    }
 
    say $result-split;
}
 
$string ~~ / (SQL) (.+{ show-captures$/ ); } $1 /;

that produces a rather simple output:

=== Iteration 1 ===
Capture 0 = SQL
Capture 1 =  is an SQL database!
[SQL][ is an SQL database!]
 
=== Iteration 2 ===
Capture 0 = SQL
Capture 1 =  database!
[SQL][ database!]

Even using the :r adverb to prevent backtracking will not change things:

my $iteration = 0;
sub show-capturesMatch $m ){
    my Str $result-split;
    say "\n=== Iteration {++$iteration} ===";
    for $m.list.kv -> $i$capture {
        say "Capture $i = $capture";
        $result-split ~= '[' ~ $capture ~ ']';
    }
 
    say $result-split;
}
 
$string ~~ / :r (SQL) (.+{ show-captures$/ ); } $1 /;

and the output will remain the same:

=== Iteration 1 ===
Capture 0 = SQL
Capture 1 =  is an SQL database!
[SQL][ is an SQL database!]
 
=== Iteration 2 ===
Capture 0 = SQL
Capture 1 =  database!
[SQL][ database!]

This demonstrate that disabling backtracking does not mean disabling possible multiple iterations of the matching engine, but rather disabling the backward matching tuning.

$/ changes each time a regular expression is matched

It is worth noting that each time a regular expression is used, the Match object returned (i.e., $/) is reset. In other words, $/ always refers to the very last regular expression matched:

my $answer = 'a lot of Stuff';
say 'Hit a capital letter!' if $answer ~~ / <[A..Z>]> /;
say $/;  # OUTPUT: 「S」 
say 'hit an x!' if $answer ~~ / x /;
say $/;  # OUTPUT: Nil 

The reset of $/ applies independently from the scope where the regular expression is matched:

my $answer = 'a lot of Stuff';
if $answer ~~ / <[A..Z>]> / {
   say 'Hit a capital letter';
   say $/;  # OUTPUT: 「S」 
}
say $/;  # OUTPUT: 「S」 
 
if True {
  say 'hit an x!' if $answer ~~ / x /;
  say $/;  # OUTPUT: Nil 
}
 
say $/;  # OUTPUT: Nil 

The very same concept applies to named captures:

my $answer = 'a lot of Stuff';
if $answer ~~ / $<capital>=<[A..Z>]> / {
   say 'Hit a capital letter';
   say $/<capital># OUTPUT: 「S」 
}
 
say $/<capital>;    # OUTPUT: 「S」 
say 'hit an x!' if $answer ~~ / $<x>=x /;
say $/<x>;          # OUTPUT: Nil 
say $/<capital>;    # OUTPUT: Nil