Using UTF-8 to Store Integers

UTF-8 is the encoding used to store Unicode codepoints. Almost every website transmits information in UTF-8. What UTF-8 does, at a low level, is take a 21 bit integer and turn it into a sequence of 8 bit integers, transformed using the following scheme:

Range	Byte 1	Byte 2	Byte 3	Byte 4
0x00-0x7F	0xxxxxxx
0x80-0x7FF	110xxxxx	10xxxxxx
0x800-0xFFFF	1110xxxx	10xxxxxx	10xxxxxx
0x10000-0x1FFFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

(There are some integers that are not allowed to be encoded by the Unicode standard, but that’s not important right now.)

Why use UTF-8? There are a number of text-encoding reasons to prefer UTF-8 (backwards compatability with ASCII), but the reason I care about is self-synchronization: even if bytes are deleted from the sequence, a decoder can decode all other valid and untouched UTF-8 sequences.

The only peculariarity of UTF-8 decoding is overlong encodings, which are UTF-8 sequences that encode the same sequence (0xC0 0x80 and 0x00 encode the same Unicode character). This can be a security issue if UTF-8 text is compared byte by byte without decoding the sequences first. The uniqueness issue is easily fixed by specifing that the correct encoding is the smallest sequence.

There is nothing inherent to the UTF-8 encoding schema that forbids further extension:

Range	Byte 1	# Continue Bytes
0x00-0x7F	0xxxxxxx	0
0x80-0x7FF	110xxxxx	1
0x800-0xFFFF	1110xxxx	2
0x10000-0x1FFFFF	11110xxx	3
0x2000000-0x3FFFFFF	111110xx	4
0x4000000-0x7FFFFFFF	1111110x	5
0x80000000-0xFFFFFFFFF	11111110	6
0x1000000000-0x3FFFFFFFFFF	11111111	7

This schema encodes all 42 bit integers, with the same guarantee of self-synchronization that standard UTF-8 has. But the idea behind UTF-8 (leading bytes are different than suceeding bytes) doesn’t depend on the amount of continue bytes each sequence has. Another UTF-8 esque encoding that still preserves self-synchronization is

Range	Byte 1	# Continue Bytes
0x00-0x7FFF	0xxxxxxx	1
0x8000-0x7FFFFF	110xxxxx	3
0x800000-0x3FFFFFFFF	1110xxxx	5
0x400000000-0x1FFFFFFFFFFF	11110xxx	7
0x200000000000-0x3FFFFFFFFFFFFF	111110xx	9
0x40000000000000-0x7FFFFFFFFFFFFFFFF	1111110x	11
0x80000000000000000-0x3FFFFFFFFFFFFFFFFFFF	11111110	13
0x40000000000000000000-0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	11111111	24

This encodes 144 bit integers.

Overlong encodings are still an issue, but if one only works with decoded integers the additional encodings can be exploited to encode additional information. For example, Java sometimes encodes the NUL character as 0xC0 0x80. This avoids potential issues where the bare 0x00 byte is treated as the end of the string. In effect, Java used an overlong encoding to provide a natural way to escape the NUL terminator.

UTF-8 sequences are also longer than raw binary sequences, but this is only a problem for memory-constrained systems that must store the sequences, or bandwidth-constrained systems that require a lot of traffic.

So should you use a special UTF-8 encoding? Probably not. You will need to write your own encoder (or modify another) in order to use your specifically designed dialect of UTF-8. In some cases, you need to deal with integers that are larger than 32 bits, which causes a lot of pain when writing code with fixed length integers for 32 bit systems.

I did use this idea in my project creole, which is a very basic bytecode interpreter and assembler.

Using UTF-8 to Store Integers

2023/02/19