Using UTF-8 to Store Integers

2023/02/19

UTF-8 is the encoding used to store Unicode codepoints. Almost every website transmits information in UTF-8. What UTF-8 does, at a low level, is take a 21 bit integer and turn it into a sequence of 8 bit integers, transformed using the following scheme:

Range Byte 1 Byte 2 Byte 3 Byte 4
0x00-0x7F 0xxxxxxx
0x80-0x7FF 110xxxxx 10xxxxxx
0x800-0xFFFF 1110xxxx 10xxxxxx 10xxxxxx
0x10000-0x1FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

(There are some integers that are not allowed to be encoded by the Unicode standard, but that’s not important right now.)

Why use UTF-8? There are a number of text-encoding reasons to prefer UTF-8 (backwards compatability with ASCII), but the reason I care about is self-synchronization: even if bytes are deleted from the sequence, a decoder can decode all other valid and untouched UTF-8 sequences.

The only peculariarity of UTF-8 decoding is overlong encodings, which are UTF-8 sequences that encode the same sequence (0xC0 0x80 and 0x00 encode the same Unicode character). This can be a security issue if UTF-8 text is compared byte by byte without decoding the sequences first. The uniqueness issue is easily fixed by specifing that the correct encoding is the smallest sequence.

There is nothing inherent to the UTF-8 encoding schema that forbids further extension:

Range Byte 1 # Continue Bytes
0x00-0x7F 0xxxxxxx 0
0x80-0x7FF 110xxxxx 1
0x800-0xFFFF 1110xxxx 2
0x10000-0x1FFFFF 11110xxx 3
0x2000000-0x3FFFFFF 111110xx 4
0x4000000-0x7FFFFFFF 1111110x 5
0x80000000-0xFFFFFFFFF 11111110 6
0x1000000000-0x3FFFFFFFFFF 11111111 7

This schema encodes all 42 bit integers, with the same guarantee of self-synchronization that standard UTF-8 has. But the idea behind UTF-8 (leading bytes are different than suceeding bytes) doesn’t depend on the amount of continue bytes each sequence has. Another UTF-8 esque encoding that still preserves self-synchronization is

Range Byte 1 # Continue Bytes
0x00-0x7FFF 0xxxxxxx 1
0x8000-0x7FFFFF 110xxxxx 3
0x800000-0x3FFFFFFFF 1110xxxx 5
0x400000000-0x1FFFFFFFFFFF 11110xxx 7
0x200000000000-0x3FFFFFFFFFFFFF 111110xx 9
0x40000000000000-0x7FFFFFFFFFFFFFFFF 1111110x 11
0x80000000000000000-0x3FFFFFFFFFFFFFFFFFFF 11111110 13
0x40000000000000000000-0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF 11111111 24

This encodes 144 bit integers.

Overlong encodings are still an issue, but if one only works with decoded integers the additional encodings can be exploited to encode additional information. For example, Java sometimes encodes the NUL character as 0xC0 0x80. This avoids potential issues where the bare 0x00 byte is treated as the end of the string. In effect, Java used an overlong encoding to provide a natural way to escape the NUL terminator.

UTF-8 sequences are also longer than raw binary sequences, but this is only a problem for memory-constrained systems that must store the sequences, or bandwidth-constrained systems that require a lot of traffic.

So should you use a special UTF-8 encoding? Probably not. You will need to write your own encoder (or modify another) in order to use your specifically designed dialect of UTF-8. In some cases, you need to deal with integers that are larger than 32 bits, which causes a lot of pain when writing code with fixed length integers for 32 bit systems.

I did use this idea in my project creole, which is a very basic bytecode interpreter and assembler.