UTF-8 is the encoding used to store Unicode codepoints. Almost every website transmits information in UTF-8. What UTF-8 does, at a low level, is take a 21 bit integer and turn it into a sequence of 8 bit integers, transformed using the following scheme:
Range | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|
0x00-0x7F | 0xxxxxxx | |||
0x80-0x7FF | 110xxxxx | 10xxxxxx | ||
0x800-0xFFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
0x10000-0x1FFFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
(There are some integers that are not allowed to be encoded by the Unicode standard, but that’s not important right now.)
Why use UTF-8? There are a number of text-encoding reasons to prefer UTF-8 (backwards compatability with ASCII), but the reason I care about is self-synchronization: even if bytes are deleted from the sequence, a decoder can decode all other valid and untouched UTF-8 sequences.
The only peculariarity of UTF-8 decoding is overlong encodings, which are UTF-8
sequences that encode the same sequence (0xC0 0x80
and 0x00
encode the same
Unicode character). This can be a security issue if UTF-8 text is compared byte
by byte without decoding the sequences first. The uniqueness issue is easily fixed
by specifing that the correct encoding is the smallest sequence.
There is nothing inherent to the UTF-8 encoding schema that forbids further extension:
Range | Byte 1 | # Continue Bytes |
---|---|---|
0x00-0x7F | 0xxxxxxx | 0 |
0x80-0x7FF | 110xxxxx | 1 |
0x800-0xFFFF | 1110xxxx | 2 |
0x10000-0x1FFFFF | 11110xxx | 3 |
0x2000000-0x3FFFFFF | 111110xx | 4 |
0x4000000-0x7FFFFFFF | 1111110x | 5 |
0x80000000-0xFFFFFFFFF | 11111110 | 6 |
0x1000000000-0x3FFFFFFFFFF | 11111111 | 7 |
This schema encodes all 42 bit integers, with the same guarantee of self-synchronization that standard UTF-8 has. But the idea behind UTF-8 (leading bytes are different than suceeding bytes) doesn’t depend on the amount of continue bytes each sequence has. Another UTF-8 esque encoding that still preserves self-synchronization is
Range | Byte 1 | # Continue Bytes |
---|---|---|
0x00-0x7FFF | 0xxxxxxx | 1 |
0x8000-0x7FFFFF | 110xxxxx | 3 |
0x800000-0x3FFFFFFFF | 1110xxxx | 5 |
0x400000000-0x1FFFFFFFFFFF | 11110xxx | 7 |
0x200000000000-0x3FFFFFFFFFFFFF | 111110xx | 9 |
0x40000000000000-0x7FFFFFFFFFFFFFFFF | 1111110x | 11 |
0x80000000000000000-0x3FFFFFFFFFFFFFFFFFFF | 11111110 | 13 |
0x40000000000000000000-0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF | 11111111 | 24 |
This encodes 144 bit integers.
Overlong encodings are still an issue, but if one only works with decoded integers
the additional encodings can be exploited to encode additional information. For example,
Java sometimes encodes the NUL character as 0xC0 0x80
. This avoids potential issues
where the bare 0x00
byte is treated as the end of the string.
In effect, Java used an overlong encoding to provide a natural way to
escape the NUL terminator.
UTF-8 sequences are also longer than raw binary sequences, but this is only a problem for memory-constrained systems that must store the sequences, or bandwidth-constrained systems that require a lot of traffic.
So should you use a special UTF-8 encoding? Probably not. You will need to write your own encoder (or modify another) in order to use your specifically designed dialect of UTF-8. In some cases, you need to deal with integers that are larger than 32 bits, which causes a lot of pain when writing code with fixed length integers for 32 bit systems.
I did use this idea in my project creole, which is a very basic bytecode interpreter and assembler.