Unicode Support in Perl 6

Perl 6 has a high level of support of Unicode. This document aims to be both an overview as well as describe Unicode features which don't belong in the documentation for routines and methods.

For an overview on MoarVM's internal representation of strings, see the MoarVM string documentation.

File Handles and I/O

Perl6 applies normalization by default to all input and output except for file names which are stored as UTF8-C8. What does this mean? For example á can be represented 2 ways. Either using one codepoint:


Or two codepoints:


Perl 6 will turn both these inputs into one codepoint, as is specified for normalization form canonical (NFC). In most cases this is useful and means that two inputs that are equivalent both are treated the same, and any text you process or output from Perl 6 will be in this "canonical" form.

One case where we don't default to this, is for file handles. This is because file handles must be accessed exactly as the bytes are written on the disk.

You can use UTF8-C8 with any file handle to read the exact bytes as they are on disk. They may look funny when printed out, if you print it out using a UTF8 handle. If you print it out to a handle where the output is UTF8-C8, then it will render as you would normally expect, and be a byte for byte exact copy. More technical details on UTF8-C8 on MoarVM below.


UTF-8 Clean-8 is an encoder/decoder that primarily works as the UTF-8 one. However, upon encountering a byte sequence that will either not decode as valid UTF-8, or that would not round-trip due to normalization, it will use NFG synthetics to keep track of the original bytes involved. This means that encoding back to UTF-8 Clean-8 will be able to recreate the bytes as they originally existed. The synthetics contain 4 codepoints:

Under normal UTF-8 encoding, this means the unrepresentable characters will come out as something like `?xFF`.

UTF-8 Clean-8 is used in places where MoarVM receives strings from the environment, command line arguments, and file system queries.

Entering Unicode Codepoints and Codepoint Sequences

You can enter Unicode codepoints by number (decimal as well as hexadecimal). For example, the character named "latin capital letter ae with macron" has decimal codepoint 482 and hexadecimal codepoint 0x1E2:

say "\c[482]"# OUTPUT: «Ǣ␤» 
say "\x1E2";   # OUTPUT: «Ǣ␤» 

You can also access Unicode codepoints by name: Rakudo supports all Unicode 9.0 names.

say "\c[PENGUIN]"# OUTPUT: «🐧␤» 
say "\c[BELL]";    # OUTPUT: «🔔␤» (U+1F514 BELL) 

All Unicode codepoint names/named seq/emoji sequences are now case-insensitive: [Starting in 2017.02]

say "\c[latin capital letter ae with macron]"# OUTPUT: «Ǣ␤» 
say "\c[latin capital letter E]";              # OUTPUT: «E␤» (U+0045) 

You can specify multiple characters by using a comma separated list with \c[]. You can combine numeric and named styles as well:

say "\c[482,PENGUIN]"# OUTPUT: «Ǣ🐧␤» 

In addition to using \c[] inside interpolated strings, you can also use the uniparse:

say "DIGIT ONE".uniparse;  # OUTPUT: «1␤» 
say uniparse("DIGIT ONE"); # OUTPUT: «1␤» 

Name Aliases

By name alias. Name Aliases are used mainly for codepoints without an official name, for abbreviations, or for corrections (Unicode names never change). For full list of them see here.

Control codes without any official name:

say "\c[ALERT]";     # Not visible (U+0007 control code (also accessible as \a)) 
say "\c[LINE FEED]"# Not visible (U+000A same as "\n") 


# This one is a spelling mistake that was corrected in a Name Alias: 


say "\c[ZWJ]".uniname;  # OUTPUT: «ZERO WIDTH JOINER␤» 
say "\c[NBSP]".uniname# OUTPUT: «NO-BREAK SPACE␤» 

Named Sequences

You can also use any of the Named Sequences, these are not single codepoints, but sequences of them. [Starting in 2017.02]


Emoji Sequences

Rakudo has support for Emoji 4.0 (the latest non-draft release) sequences. For all of them see: Emoji ZWJ Sequences and Emoji Sequences. Note that any names with commas should have their commas removed, since Perl 6 uses commas to separate different codepoints/sequences inside the same \c sequence.

say "\c[woman gesturing OK]";         # OUTPUT: «🙆‍♀️␤» 
say "\c[family: man woman girl boy]"# OUTPUT: «👨‍👩‍👧‍👦␤»