QRS: Matching “.” in UTF-8

Back on December 13th, I posted a challenge on Mastodon: In a simple UTF-8 byte-driven finite automaton, how many states does it take to match the regular-expression construct “.”, i.e. “any character”? Commenter Anthony Williams responded, getting it almost right I think, but I found his description a little hard to understand. In this piece I’m going to dig into what . actually means, and then how many states you need to match it.
[Update: Lots more on this subject and some of the material below is arguably wrong, but just “arguably”; see Dot-matching Redux.]

The answer surprised me. Obviously this is of interest only to the faction of people who are interested in automaton wrangling, problematic characters, and the finer points of UTF-8. I expect close attention from all 17 of you!

The answer is… · Four. Or five, depending.

What’s a “Unicode character”? · They’re represented by “code points”, which are numbers in the range 0 … 17×2¹⁶, which is to say 1,114,112 possible values. It turns out you don’t actually want to match all of them; more on that later.

How many states? · Quamina is a “byte-level automaton” which means it’s in a state, it reads a byte, looks up the value of that byte in a map yielding either the next state, or nil, which means no match. Repeat until you match or fail.

What bytes are we talking about here? We’re talking about UTF-8 bytes. If you don’t understand UTF-8 the rest of this is going to be difficult. I wrote a short explainer called Characters vs. Bytes twenty-one years ago. I now assume you understand UTF-8 and knew that code points are encoded as sequences of from 1 to 4 bytes.

Let’s count!

When you match a code point successfully you move to the part of the automaton that’s trying to match the next one; let’s call this condition MATCHED.

(From here on, all the numbers are hex, I’ll skip the leading 0x. And all the ranges are inclusive.)
In multi-byte characters, all the UTF-8 bytes but the first have bitmasks like 10XX XXXX, so there are six significant bits, thus 2⁶ or 64 distinct possible values ranging from 80-BF.
There’s a Start state. It maps byte values 00-7F (as in ASCII) to MATCHED. That’s our first state, and we’ve handled all the one-byte code points.
In the Start state, the 32 byte values C0-DF, all of which begin 110 signaling a two-byte code point, are mapped to the Last state. In the Last state, the 64 values 80-BF are mapped to MATCHED. This takes care of all the two-byte code points and we’re up to two states.
In the Start state, the 16 byte values E0-EF, all of which begin 1110 signaling a three-byte code point, are mapped to the LastInter state. In that state, the 64 values 80-BF are mapped to the Last state. Now we’re up to three states and we’ve handled the three-byte code points.
In the Start state, the 8 byte values F0-F7, all of which begin 11110 signaling a four-byte code point, are mapped to the FirstInter state. In that state, the 64 values 80-BF are mapped to the LastInter state. Now we’ve handled all the code points with four states.

But wait! · I mentioned above about not wanting to match all the code points. “Wait,” you say, “why wouldn’t you want to be maximally inclusive?!” Once again, I’ll link to Unicode Character Repertoire Subsets, a document I co-wrote that is making its way through the IETF and may become an RFC some year. I’m not going to try to summarize a draft that bends over backwards to be short and clear; suffice it to say that there are good reasons for leaving out several different flavors of code point.

Probably the most pernicious code points are the “Surrogates”, U+D800-U+DFFF. If you want an explanation of what they are and why they’re bad, go read that Repertoire Subsets draft or just take my word for it. If you were to encode them per UTF-8 rules (which the UTF-8 spec says you’re not allowed to), the low and high bounds would be ED,A0,80 and ED,BF,BF.

Go’s UTF-8 implementation agrees that Surrogates Are Bad and The UTF-8 Spec Is Good and flatly refuses to convert those UTF-8 sequences into code points or vice versa. The resulting subset of code points even has a catchy name: Unicode Scalars. Case closed, right?

Wrong. Because JSON was designed before we’d thought through these problems, explicitly saying it’s OK to include any code point whatsoever, including surrogates. And Quamina is used for matching JSON data. So, standards fight!

I’m being a little unfair here. I’m sure that if Doug Crockford were inventing JSON now instead of in 2001, he’d exclude surrogates and probably some of the other problematic code points discussed in that Subsets doc.

Anyhow, Quamina will go with Go and exclude surrogates. Any RFC8259 purists out there, feel free accuse me of standards apostasy and I will grant your point but won’t change Quamina. Actually, not true; at some point I’ll probably add an option to be more restrictive and exclude more than just surrogates.

Which means that now we have to go back to the start of this essay and figure out how many states it takes to match “.” Let’s see…

The Start state changes a bit. See #5 in the list above. Instead of mapping all of E0-EF to the LastInter state, it maps one byte in that range, ED, to a new state we’ll call, let’s see, how about ED.
In ED, just as in LastInter, 80-9F are mapped to Last. But A0-BF aren’t mapped to anything, because on that path lie the surrogates.

So, going with the Unicode Scalar path of virtue means I need five states, not four.

Contributions

Comment feed for ongoing:

From: Ed Davies (Dec 21 2024, at 03:21)

Shouldn't UTF-8 byte sequences which “validly“ represent codepoints but with excess bytes be excluded?

E.g., 0xC1 0xA1 encodes LATIN SMALL LETTER A as a two byte sequence when it should be encoded with just one byte. This is a known way to sneak characters past security checks.

[link]

From: Tim (Dec 21 2024, at 09:36)

Hey Ed, yes, the general term for that whole problem is “normalization”, there are various “normal forms”. It’s quite complex. See https://en.wikipedia.org/wiki/Unicode_equivalence - At the moment Quamina works on code points rather than any higher-level abstraction. However, it's already got case-folding - very similar problem - and there's nothing in the design that could prevent having a pattern that required normalization at match time.

Having said that, Quamina is mostly used at a low level for event dispatching. A lot of times the key/value you match on is going to be used as a database key and for that kind of thing you want to stay away from normalization. So it wouldn't be a good idea to normalize by default.

[link]

From: Robert Sayre (Dec 21 2024, at 12:31)

Well, this is why the part on escape sequences is in your draft, right?

https://go.dev/play/p/-H9LU4g9OEh

Go takes one of your recommendations for dealing with these in your draft, but I'm pretty sure it could be hacked to round-trip that monstrosity.

I was horrified to learn that a few implementations do encode these directly in invalid UTF-8, but I think most UTF-8 encoders and decoders would reject these, and no one does this.

So, is it ok for the state machine here for "." to allow the escaped surrogate here?

[link]

From: Ian Z (Dec 21 2024, at 12:45)

Even disregarding surrogates, it is highly debatable if we can sensibly identify "code point" and "character". There's all that combining stuff especially, it seems, in South Asian scripts. See for example the discussion here:

https://discuss.ocaml.org/t/literals-for-uchar-t-unicode-code-points-more-precisely-unicode-scalar-values/13111

and here:

https://discuss.ocaml.org/t/ocaml-standard-library-unicode-support/14520

[link]

From: Ed Davies (Dec 23 2024, at 08:17)

Hi Tim,

No, I'm not talking about normalization, which is indeed a higher level issue, but rather the well-formedness of the UTF-8 encoding itself.

Looking at Table 3-7 in the Unicode standard (https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G27506) they give the examples of the byte values C0 and C1 which are never valid in well-formed UTF-8 but which you accept.

I think to properly match only well-formed UTF-8 a few more states are needed. As well as your Start, Matched, Last and Last Intermediate there needs to be separate states for each of the rows of table 3-7 except for the first, to match the second byte though the states where the first byte is E1..E2 and EE..EF can be shared.

[link]

From: Robert Sayre (Dec 23 2024, at 09:04)

Oh, I meant to mention this more explicitly: As an RFC8259 purist, I have never interpreted it to allow invalid UTF-8, and I don't think Crockford did either (I was on the ES5 committee that stardardized JSON in JS next to him). Surrogates show up as a way to express astral plane stuff in escape sequences, and UTF-16 OS strings usually allow them. So they are in the repertoire, but that's not meant to excuse invalid UTF-8, and no JSON spec does.

If I could channel Crockford on the issue of whether he would allow these now, I think he would say something like "Well, it took off because you could just use eval() at first. So, that worked." However, he's often surprising, so who knows.

[link]

From: Tim (Dec 23 2024, at 15:18)

Ed, you're right. I'm going to have to refine the state machine. It may be as simple as removing the start state mappings for C0 and C1, but I'll take a careful look.

[link]

From: Ed Davies (Dec 25 2024, at 10:30)

Here's my version of a state machine to recognize well-formed UTF-8 as per the Unicode 16 standard (i.e., a re-coding of their Table 3-7). It needs 4 extra states, not just your “ED” state, in order to fully deal with both surrogates and “stretched” encodings:

https://edavies.me.uk/paste/R2XVFXpjcIiVq9FkaZ2u7ZUC/utf8.svg

[link]

From: Tim (Dec 25 2024, at 14:28)

Ed, that looks right to me, so my next step is to build a unit test that explores the space of valid vs invalid sequences, then I'll rebuild the automaton, then I'll know if it worked.

This has been an education to me; and I thought I understood UTF-8 pretty fully prior to this. Thanks!

[link]

From: Robert Sayre (Dec 27 2024, at 12:20)

That pretty closely resembles Bjoern's state machine I've linked before. Have a look at his "Implementation details" writing there. I didn't mean to suggest you should use his C library, but that you might be able to adapt it to Go.

[link]

ongoing

What this is ·

Truth · Biz · Tech

author · Dad
colophon · rights

December 18, 2024
· Technology (90 fragments)
· · Quamina Diary (17 more)
· · Software (86 more)

By Tim Bray.

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!