The WTF-8 encoding

Editor:
Simon Sapin (Mozilla)
Issue tracking:
On GitHub
Change history:
On GitHub
Last updated:
1 September 2017

Abstract

WTF-8 (Wobbly Transformation Format − 8-bit) is a superset of UTF-8 that encodes surrogate code points if they are not in a pair. It represents, in a way compatible with UTF-8, text from systems such as JavaScript and Windows that use UTF-16 internally but don’t enforce the well-formedness invariant that surrogates must be paired.

Table of Contents

  1. 1 Intended audience
  2. 2 Background and motivation
    1. 2.1 Differences with CESU-8
  3. 3 Terminology
    1. 3.1 Surrogate code points
    2. 3.2 Surrogate 16-bit code units
    3. 3.3 Surrogate byte sequences
  4. 4 Potentially ill-formed UTF-16
    1. 4.1 Encoding
    2. 4.2 Decoding
  5. 5 Generalized UTF-8
  6. 6 The WTF-8 encoding
    1. 6.1 Encoding
    2. 6.2 Decoding
    3. 6.3 Converting between WTF-8 and potentially ill-formed UTF-16
    4. 6.4 Converting between WTF-8 and UTF-8
    5. 6.5 Concatenating WTF-8 strings
  7. 7 Implementations
  8. 8 Acknowledgments
  9. Conformance
  10. Index
    1. Terms defined by this specification
  11. References
    1. Normative References
    2. Informative References

1. Intended audience

WTF-8 is a hack intended to be used internally in self-contained systems with components that need to support potentially ill-formed UTF-16 for legacy reasons.

Any WTF-8 data must be converted to a Unicode encoding at the system’s boundary before being emitted. UTF-8 is recommended. WTF-8 must not be used to represent text in a file format or for transmission over the Internet.

In particular, the Encoding Standard [ENCODING] defines UTF-8 and other encodings for the Web. There is no and will not be any encoding label [ENCODING] or IANA charset alias [CHARSETS] for WTF-8.

2. Background and motivation

This section is non-normative.

When Unicode 1.0 was published in 1991, it defined 65536 code points from U+0000 to U+FFFF and assigned characters to around half of them. Many software implementations chose the obvious memory representation for Unicode text of 16 bits per code point / character.

At the time, “Unicode” was synonymous with that particular encoding. To disambiguate, that encoding is now called UCS-2.

As subsequent versions of Unicode assigned more characters, it became apparent that 65536 code points would not be sufficient. Unicode was extended to 1114111 code points from U+0000 to U+10FFFF, and the UTF-16 encoding was introduced. This encoding preserves compatibility with existing 16-bit based systems and represents new (supplementary) code points as a pair of “surrogates”.

UTF-16 is designed to represent any Unicode text, but it can not represent a surrogate code point pair since the corresponding surrogate 16-bit code unit pairs would instead represent a supplementary code point. Therefore, the concept of Unicode scalar value was introduced and Unicode text was restricted to not contain any surrogate code point. (This was presumably deemed simpler that only restricting pairs.)

UTF-16 was redefined to be ill-formed if it contains unpaired surrogate 16-bit code units. UTF-8 was similarly redefined to be ill-formed if it contains surrogate byte sequences.

Meanwhile, 16-bit based systems had little to no incentive to do anything about surrogates: For several years, Unicode did not assign any character to supplementary code points, and then (until emoji) only comparatively rare characters. Additionally, the Unicode Standard does not require conforming implementations to maintain well-formedness of UTF-16 strings.

As a result, surrogates do occur in practice and need to be preserved. For example:

We say that strings in these systems are encoded in potentially ill-formed UTF-16 or WTF-16.

Unpaired surrogate 16-bit code units are the only case where an arbitrary sequence of 16-bit code units is ill-formed in UTF-16. UTF-8, however, is more complex and maintaining its well-formedness is arguably more valuable.

This specification defines WTF-8, a superset of UTF-8 that can losslessly represent arbitrary sequences of 16-bit code unit (even if ill-formed in UTF-16) but preserves the other well-formedness constraints of UTF-8.

2.1. Differences with CESU-8

Unicode defines a Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8). WTF-8 is different from CESU-8.

CESU-8 encodes supplementary code points as surrogate pair byte sequences of six bytes, whereas WTF-8, like UTF-8, encodes them as sequences of four bytes. Therefore, CESU-8 is not a superset of UTF-8.

CESU-8 is also a mapping on UTF-16 code units. Therefore unpaired surrogate byte sequences are ill-formed in CESU-8, whereas supporting them is the entire point of WTF-8.

3. Terminology

These definitions correspond to those of the Glossary of Unicode Terms. [UNICODE]

A Unicode code point is any value in the Unicode codespace; that is, the range of integers from 0 to 1114111. It is noted with a “U+” prefix and four to six hexadecimal digits: the first and last code points are U+0000 and U+10FFFF.

The Basic Multilingual Plane is the range of code points from U+0000 to U+FFFF.

A BMP code point is a code point in the Basic Multilingual Plane.

A supplementary code point is a code point not in the Basic Multilingual Plane. That is, a code point in the range from U+10000 to U+10FFFF.

A Unicode scalar value is a code point that is not a surrogate code point. That is, a code point in the range from U+0000 to U+D7FF, or in the range U+E000 to U+10FFFF.

A BMP scalar value is a Unicode scalar value in the Basic Multilingual Plane. That is, a code point in the range from U+0000 to U+D7FF, or in the range U+E000 to U+FFFF.

Unicode text is a sequence of Unicode scalar values.

UTF-8 is an encoding of Unicode text using 8-bit bytes. Each Unicode scalar value is represented as a sequence of one to four bytes.

UTF-16 is an encoding of Unicode text using 16-bit code units. BMP scalar values are represented as a single 16-bit code unit with the same value. Supplementary code points are represented as a surrogate 16-bit code unit pair.

Note: this specification is only concerned with the UTF-16 encoding form (based on 16-bit code units), and not with the encoding scheme (based on bytes, with UTF-16BE and UTF-16LE variants).

A string is well-formed (not ill-formed) in a given encoding if it follows the specification of that encoding. [UNICODE] defines Well-Formed Code Unit Sequence for UTF-8 and UTF-16.

In particular:

The replacement character is the code point U+FFFD REPLACEMENT CHARACTER (�). It is used as a substitute to replace ill-formed sub-sequences during a conversion.

A 16-bit code unit is a 16-bit integer used in UTF-16. It is noted with a “0x” prefix and four hexadecimal digits: the first and last 16-bit code units are 0x0000 and 0xFFFF.

Note: The byte serialization or memory representation of a 16-bit code unit (little-endian or big-endian) is out of scope for this specification.

When an algorithm iterates over a sequence (“For every i in …”), consuming the next item means advancing in the sequence such that that item will be skipped during the following iteration of the loop: the item after the next becomes the next item.

The following algorithm prints “1”, “2”, and “4”.

For every digit i in “1234”, run these substeps:

  1. Print i
  2. If i is 2, consume the next digit.

3.1. Surrogate code points

A lead surrogate code point or high surrogate code point is a code point in the range from U+D800 to U+DBFF.

A trail surrogate code point or low surrogate code point is a code point in the range from U+DC00 to U+DFFF.

A surrogate code point is either a lead surrogate code point or a trail surrogate code point. That is, a code point in the range from U+D800 to U+DFFF.

A surrogate code point pair is a sequence of a lead surrogate code point followed by a trail surrogate code point.

An unpaired surrogate code point is a surrogate code point that is not part of a surrogate code point pair.

3.2. Surrogate 16-bit code units

A lead surrogate 16-bit code unit or high surrogate 16-bit code unit is a 16-bit code unit in the range from 0xD800 to 0xDBFF.

A trail surrogate 16-bit code unit or low surrogate 16-bit code unit is a 16-bit code unit in the range from 0xDC00 to 0xDFFF.

A surrogate 16-bit code unit is either a lead surrogate 16-bit code unit or a trail surrogate 16-bit code unit. That is, a 16-bit code unit in the range from 0xD800 to 0xDFFF.

A surrogate 16-bit code unit pair is a sequence of a lead surrogate 16-bit code unit followed by a trail surrogate 16-bit code unit. In UTF-16, it represents a supplementary code point.

An unpaired surrogate 16-bit code unit is a surrogate 16-bit code unit that is not part of a surrogate 16-bit code unit pair.

3.3. Surrogate byte sequences

Note: A surrogate byte sequence (and therefore any byte sequence described in this section) is ill-formed in UTF-8. Decoders are required to treat it as an error.

A lead surrogate byte sequence or high surrogate byte sequence is a sequence of three bytes that represents a lead surrogate code point in generalized UTF-8.

A trail surrogate byte sequence or low surrogate byte sequence is a sequence of three bytes that represents a trail surrogate code point in generalized UTF-8.

A surrogate byte sequence is either a lead surrogate byte sequence or a trail surrogate byte sequence. That is, a sequence of three bytes that represents a surrogate code point in generalized UTF-8.

Table 1. Surrogate byte sequences

Bytes noted in hexadecimal.

First byte Second byte Third byte
Lead surrogate byte sequence ED A0 to AF 80 to BF
Trail surrogate byte sequence ED B0 to BF 80 to BF
Surrogate byte sequence ED A0 to BF 80 to BF

A surrogate pair byte sequence is a sequence six bytes composed of a lead surrogate byte sequence followed by a trail surrogate byte sequence.

An unpaired surrogate byte sequence is a surrogate byte sequence that is not part of a surrogate pair byte sequence.

4. Potentially ill-formed UTF-16

A sequence of 16-bit code units is potentially ill-formed UTF-16 if it is intended to be interpreted as UTF-16, but is not necessarily well-formed in UTF-16. It effectively encodes a sequence of code points that do not contain any surrogate code point pair.

Note: Like UTF-16, potentially ill-formed UTF-16 can not represent a surrogate code point pair since the corresponding surrogate 16-bit code unit pair would instead represent a supplementary code point. Unlike well-formed UTF-16, it might contain isolated surrogate code points.

Any sequence of 16-bit code units has an interpretation as potentially ill-formed UTF-16.

WTF-16 is sometimes used as a shorter name for potentially ill-formed UTF-16, especially in the context of systems were originally designed for UCS-2 and later upgraded to UTF-16 but never enforced well-formedness, either by neglect or because of backward-compatibility constraints.

4.1. Encoding

To encode from code points to potentially ill-formed UTF-16, run these steps:

  1. Let result be a sequence of 16-bit code units, initially empty.
  2. For every code point P of the input, run these substeps:
    1. If P is a supplementary code point, append to result two 16-bit code units of values:
      1. ((P - 0x10000) >> 10) + 0xD800
      2. ((P - 0x10000) & 0x3FF) + 0xDC00
    2. Otherwise (P is a BMP code point), append to result a 16-bit code unit of value P.
  3. Return result.

Note: If the input is restricted to Unicode text, this is identical to encoding to UTF-16 and the resulting sequence is well-formed in UTF-16.

If, on the other hand, the input contains a surrogate code point pair, the conversion will be incorrect and the resulting sequence will not represent the original code points.

This situation should be considered an error, but this specification does not define how to handle it. Possibilities include aborting the conversion, or replacing one of the surrogate code points of the pair with a replacement character.

4.2. Decoding

To decode from potentially ill-formed UTF-16 to code points, run these steps:

  1. Let result be a sequence of code points, initially empty.
  2. For every 16-bit code unit U of the input, run these substeps:
    1. If U is a lead surrogate 16-bit code unit, U is not the last 16-bit code unit of the input, and the next 16-bit code unit of the input next is a trail surrogate 16-bit code unit, then consume next and append to result a code point of value 0x10000 + ((U - 0xD800) << 10) + (next - 0xDC00).
    2. Otherwise, append to result a code point of value U.
  3. Return result.

Note: By construction, the resulting sequence does not contain a surrogate code point pair.

Note: If the input is well-formed in UTF-16, this is identical to decoding UTF-16 and the resulting sequence is Unicode text.

5. Generalized UTF-8

For the purpose of this specification, generalized UTF-8 is an encoding of sequences of code points (not restricted to Unicode scalar values) using 8-bit bytes, based on the same underlying algorithm as UTF-8. It is a strict superset of UTF-8 (like UTF-8 is a strict superset of ASCII).

Each code point is encoded as a sequence of one to four bytes:

Table 2. Bit distribution

Bytes noted in binary, most significant bit first. x bits represent the least significant bits of the code points.

Code point First byte Second byte Third byte Fourth byte
U+0000 to U+007F 0xxxxxxx
U+0080 to U+07FF 110xxxxx 10xxxxxx
U+0800 to U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
U+10000 to U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

A byte sequence is well-formed in generalized UTF-8 if and only if:

Table 3. Well-formed byte sequences representing a single code point

Bytes noted in hexadecimal.

Code point First byte Second byte Third byte Fourth byte
U+0000 to U+007F 00 to 7F
U+0080 to U+07FF C2 to DF 80 to BF
U+0800 to U+0FFF E0 A0 to BF 80 to BF
U+1000 to U+FFFF E1 to EF 80 to BF 80 to BF
U+10000 to U+3FFFF F0 90 to BF 80 to BF 80 to BF
U+40000 to U+FFFFF F1 to F3 80 to BF 80 to BF 80 to BF
U+100000 to U+10FFFF F4 80 to 8F 80 to BF 80 to BF

6. The WTF-8 encoding

WTF-8 (Wobbly Transformation Format − 8-bit) is an encoding of code point sequences that do not contain any surrogate code point pair using 8-bit bytes.

Note: Like UTF-8 is artificially restricted to Unicode text in order to match UTF-16, WTF-8 is artificially restricted to exclude surrogate code point pairs in order to match potentially ill-formed UTF-16.

It is identical to generalized UTF-8, with the additional well-formedness constraint that a surrogate pair byte sequence is ill-formed. It is a strict subset of generalized UTF-8 and a strict superset of UTF-8.

Note: Similarly, UTF-8 is a strict superset of ASCII.

WTF-8 must not be used for interchange. See Intended audience.

6.1. Encoding

To encode from code points to well-formed WTF-8, run these steps:

  1. Let result be a sequence of bytes, initially empty.
  2. For every code point P of the input, run these substeps:
    1. If P is a lead surrogate code point, P is not the last code point of the input, and the next code point is a trail surrogate code point, set P’s value to: 0x10000 + ((P - 0xD800) << 10) + (next - 0xDC00).

    2. Depending on P:
      U+0000 to U+007F
      Append to result one byte of value P.
      U+0080 to U+07FF
      Append to result two bytes of values:
      1. 0xC0 | (P >> 6)
      2. 0x80 | (P & 0x3F)
      U+0800 to U+FFFF
      Append to result three bytes of values
      1. 0xE0 | (P >> 12)
      2. 0x80 | ((P >> 6) & 0x3F)
      3. 0x80 | (P & 0x3F)
      U+FFFF to U+10FFFF
      Append to result four bytes of values
      1. 0xF0 | (P >> 18)
      2. 0x80 | ((P >> 12) & 0x3F)
      3. 0x80 | ((P >> 6) & 0x3F)
      4. 0x80 | (P & 0x3F)
  3. Return result.

Note: If the input contains a surrogate code point pair, the resulting byte sequence will be not represent the original sequence of code points. Instead, it will represent the same code points as if had been encoded in potentially ill-formed UTF-16. This is also consistent with encoding each code point to WTF-8 individually, and concatenating the resulting WTF-8 byte sequences.

6.2. Decoding

To decode from well-formed WTF-8 to code points, run these steps:

Note: Since WTF-8 must not be used for interchange (see Intended audience), this algorithm is deliberately not defined for arbitrary byte sequences. It is only defined for byte sequences known to be well-formed in WTF-8, such as sequences encoded from code points, converted from UTF-16, or concatenated from sequences themselves well-formed in WTF-8.

  1. Let result be a sequence of code points, initially empty.
  2. For every byte B of the input, depending on B:
    0x00 to 0x7F
    Append to result a code point of value B.
    0xC2 to 0xDF
    Let B2 be the next byte and consume it.

    Append to result a code point of value ((B & 0x1F) << 6) + (B2 & 0x3F)

    0xE0 to 0xEF
    Let B2 and B3 be the next two bytes, and consume them.

    Append to result a code point of value ((B & 0x0F) << 12) + ((B2 & 0x3F) << 6) + (B3 & 0x3F)

    0xF0 to 0xF4
    Let B2, B3, and B4 be the next three bytes, and consume them.

    Append to result a code point of value ((B & 0x07) << 18) + ((B2 & 0x3F) << 12) + ((B3 & 0x3F) << 6) + (B4 & 0x3F)

  3. Return result.

Note: If the input is also well-formed in UTF-8, this is identical to decoding UTF-8 and the resulting sequence is Unicode text.

6.3. Converting between WTF-8 and potentially ill-formed UTF-16

To convert from potentially ill-formed UTF-16 to WTF-8, run these steps:

Note: This conversion never fails and is lossless.

To convert from WTF-8 to potentially ill-formed UTF-16, run these steps:

Note: This conversion never fails and, if the input is well-formed in WTF-8, is lossless.

6.4. Converting between WTF-8 and UTF-8

Since WTF-8 is a superset of UTF-8, any sequence of byte that is well-formed in UTF-8 is also well-formed in WTF-8 and represents the same text. To convert from UTF-8 to WTF-8, return the input unchanged.

Note: This conversion never fails and is lossless.

To convert lossily from WTF-8 to UTF-8, replace any surrogate byte sequence with the sequence of three bytes <0xEF, 0xBF, 0xBD>, the UTF-8 encoding of the replacement character.

Note: Since surrogate byte sequences are also three bytes long, this conversion can be done in place.

Note: This conversion never fails but is lossy.

To convert strictly from WTF-8 to UTF-8, run these steps:

  1. If the input contains a surrogate byte sequence, return failure.
  2. Otherwise, return the input unchanged.

Note: This conversion is lossless when it succeeds, but it can fail.

6.5. Concatenating WTF-8 strings

Concatenating WTF-8 strings requires extra care to preserve well-formedness.

To concatenate two WTF-8 strings, run these steps:

  1. If the left input string ends with a lead surrogate byte sequence and the right input string starts with a trail surrogate byte sequence, run these substeps:
    1. Let lead and trail be two code points, the respective results of decoding from WTF-8 these two surrogate byte sequences.
    2. Let supplementary be the encoding to WTF-8 of a single code point of value 0x10000 + ((lead - 0xD800) << 10) + (trail - 0xDC00)
    3. Let left be substring of the left input string that removes the three final bytes.
    4. Let right be substring of the right input string that removes the three initial bytes.
    5. Return the concatenation of left, supplementary, and right.
  2. Otherwise, return the concatenation of the two input byte sequences

Note: This is equivalent to converting both strings to potentially ill-formed UTF-16, concatenating the resulting 16-bit code unit sequences, then converting the concatenation back to WTF-8.

7. Implementations

This section is non-normative.

8. Acknowledgments

Thanks to Coralie Mercier for coining the name WTF-8.

Thanks for feedback and contributions from Anne van Kesteren, David Baron, Dylan Petonke, Henri Sivonen, Jacob Lifshay, James Graham, Kevin Ballard, Mathias Bynens, Ms2ger, Sam Tobin-Hochstadt, Tab Atkins.

Conformance

Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words do not appear in all uppercase letters in this specification.

All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. [RFC2119]

Examples in this specification are introduced with the words “for example” or are set apart from the normative text with class="example", like this:

This is an example of an informative example.

Informative notes begin with the word “Note” and are set apart from the normative text with class="note", like this:

Note, this is an informative note.

Index

Terms defined by this specification

References

Normative References

[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://tools.ietf.org/html/rfc2119
[UNICODE]
The Unicode Standard. URL: http://www.unicode.org/versions/latest/

Informative References

[CHARSETS]
Character sets. URL: https://www.iana.org/assignments/character-sets
[ENCODING]
Anne van Kesteren. Encoding Standard. Living Standard. URL: https://encoding.spec.whatwg.org/