Unicode support in Perl 6
Perl 6 has a high level of support of Unicode. This document aims to be both an overview as well as describe Unicode features which don't belong in the documentation for routines and methods.
For an overview on MoarVM's internal representation of strings, see the MoarVM string documentation.
Perl 6 applies normalization by default to all input and output except for file names which are stored as
UTF8-C8; graphemes, which are user-visible forms of the characters, will use a normalized representation. What does this mean? For example, the grapheme
á can be represented in two ways, either using one codepoint:
á (U+E1 "LATIN SMALL LETTER A WITH ACUTE")
Or two codepoints:
a + ́ (U+61 "LATIN SMALL LETTER A" + U+301 "COMBINING ACUTE ACCENT")
Perl 6 will turn both these inputs into one codepoint, as is specified for Normalization Form C (NFC). In most cases this is useful and means that two inputs that are equivalent are both treated the same. Unicode has a concept of canonical equivalence which allows us to determine the canonical form of a string, allowing us to properly compare strings and manipulate them, without having to worry about the text losing these properties. By default, any text you process or output from Perl 6 will be in this "canonical" form, even when making modifications or concatenations to the string (see below for how to avoid this). For more detailed information about Normalization Form C and canonical equivalence, see the Unicode Foundation's page on Normalization and Canonical Equivalence.
One case where we don't default to this, is for the names of files. This is because the names of files must be accessed exactly as the bytes are written on the disk.
To avoid normalization you can use a special encoding format called UTF8-C8. Using this encoding with any filehandle will allow you to read the exact bytes as they are on disk, without normalization. They may look funny when printed out, if you print it out using a UTF8 handle. If you print it out to a handle where the output encoding is UTF8-C8, then it will render as you would normally expect, and be a byte for byte exact copy. More technical details on UTF8-C8 on MoarVM below.
UTF-8 Clean-8 is an encoder/decoder that primarily works as the UTF-8 one. However, upon encountering a byte sequence that will either not decode as valid UTF-8, or that would not round-trip due to normalization, it will use NFG synthetics to keep track of the original bytes involved. This means that encoding back to UTF-8 Clean-8 will be able to recreate the bytes as they originally existed. The synthetics contain 4 codepoints:
The codepoint 0x10FFFD (which is a private use codepoint)
The codepoint 'x'
The upper 4 bits of the non-decodable byte as a hex char (0..9A..F)
The lower 4 bits as the non-decodable byte as a hex char (0..9A..F)
Under normal UTF-8 encoding, this means the unrepresentable characters will come out as something like
UTF-8 Clean-8 is used in places where MoarVM receives strings from the environment, command line arguments, and filesystem queries, for instance when decoding buffers:
say Buf.new(ord('A'), 0xFE, ord('Z')).decode('utf8-c8');# OUTPUT: «AxFEZ␤»
You can see how the two initial codepoints used by UTF8-C8 show up here, right before the "FE". You can use this type of encoding to read files with unknown encoding:
my = "/tmp/test";given open(, :w, :bin)say slurp(, enc => 'utf8-c8'); # OUTPUT: «(65 250 66 251 252 67 253)»
Reading with this type of encoding and encoding them back to UTF8-C8 will give you back the original bytes; this would not have been possible with the default UTF8-C8.
Please note that this encoding so far is not supported in the JVM implementation of Rakudo.
You can enter Unicode codepoints by number (decimal as well as hexadecimal). For example, the character named "latin capital letter ae with macron" has decimal codepoint 482 and hexadecimal codepoint 0x1E2:
say "\c"; # OUTPUT: «Ǣ␤»say "\x1E2"; # OUTPUT: «Ǣ␤»
say "\c[PENGUIN]"; # OUTPUT: «🐧␤»say "\c[BELL]"; # OUTPUT: «🔔␤» (U+1F514 BELL)
All Unicode codepoint names/named seq/emoji sequences are now case-insensitive: [Starting in 2017.02]
say "\c[latin capital letter ae with macron]"; # OUTPUT: «Ǣ␤»say "\c[latin capital letter E]"; # OUTPUT: «E␤» (U+0045)
You can specify multiple characters by using a comma separated list with
\c. You can combine numeric and named styles as well:
say "\c[482,PENGUIN]"; # OUTPUT: «Ǣ🐧␤»
In addition to using
\c inside interpolated strings, you can also use the uniparse:
say "DIGIT ONE".uniparse; # OUTPUT: «1␤»say uniparse("DIGIT ONE"); # OUTPUT: «1␤»
By name alias. Name Aliases are used mainly for codepoints without an official name, for abbreviations, or for corrections (Unicode names never change). For full list of them see here.
Control codes without any official name:
say "\c[ALERT]"; # Not visible (U+0007 control code (also accessible as \a))say "\c[LINE FEED]"; # Not visible (U+000A same as "\n")
say "\c[LATIN CAPITAL LETTER GHA]"; # OUTPUT: «Ƣ␤»say "Ƣ".uniname; # OUTPUT: «LATIN CAPITAL LETTER OI␤»# This one is a spelling mistake that was corrected in a Name Alias:say "\c[PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET]".uniname;# OUTPUT: «PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET␤»
say "\c[ZWJ]".uniname; # OUTPUT: «ZERO WIDTH JOINER␤»say "\c[NBSP]".uniname; # OUTPUT: «NO-BREAK SPACE␤»
You can also use any of the Named Sequences, these are not single codepoints, but sequences of them. [Starting in 2017.02]
say "\c[LATIN CAPITAL LETTER E WITH VERTICAL LINE BELOW AND ACUTE]"; # OUTPUT: «É̩␤»say "\c[LATIN CAPITAL LETTER E WITH VERTICAL LINE BELOW AND ACUTE]".ords; # OUTPUT: «(201 809)␤»
Rakudo has support for Emoji 4.0 (the latest non-draft release) sequences. For all of them see: Emoji ZWJ Sequences and Emoji Sequences. Note that any names with commas should have their commas removed, since Perl 6 uses commas to separate different codepoints/sequences inside the same
say "\c[woman gesturing OK]"; # OUTPUT: «🙆♀️␤»say "\c[family: man woman girl boy]"; # OUTPUT: «👨👩👧👦␤»