Epsilon Love

Quamina was for a time my favorite among all my software contributions. But then it stalled after I shipped 1.0 in January of 2023. First of all, I got busy with the expert witness for Uncle Sam gig and second, there was a horrible problem in there that I couldn’t fix. Except for now I have! And I haven’t done much codeblogging recently. So, here are notes on nondeterministic finite automata, epsilon transitions, Ken Thompson, Golang generics, and prettyprinting. If some subset of those things interests you, you’ll probably like this.

(Warning: if you’ve already had your hands on the theory and practice of finite automata, this may all be old hat.)

[Update: This is kind of embarrassing. It looks like what this post refers to as an “epsilon” is not the same epsilon that features in the theory of finite automata. I mean, it still works well for where I’m using it, but I obviously need dig in harder and deeper.]

Sidebar: What’s a Quamina? · I don’t think there’s much to be gained by duplicating Quamina’s README but in brief: “A fast pattern-matching library in Go with a large and growing pattern vocabulary and no dependencies outside Go’s standard libraries.” If you want much, much more, this Quamina Diary blog series has it.

The problem · Combining too many patterns with wild-cards in them caused Quamina 1.0’s data structures to explode in size with a growth rate not far off the terrifying O(2^N), which meant that once you’d added much more than 20 patterns you couldn’t add any more, because the add-pattern code’s runtime was O(2^N) too.

Those structures are state machines generally, “nondeterministic finite automata” (NFA’s) in particular. Which offer good solutions to many software problems, but when they get to be any size at all, are really hard to fit into a human mind. So when I was looking at Quamina’s unreasonably-big automata and trying to figure out how they got that way, my brain was screaming “Stop the pain!”

Lesson: Prettyprint! · At the point I stalled on Quamina, I’d started a refactor based on the theory that the NFAs were huge because of a failure to deduplicate state transitions. But the code I’d written based on that theory was utterly broken; it failed simple unit tests and I couldn’t see why.

During the months when I was ignoring the problem, I privately despaired because I wasn’t sure I could ever crack it, and I couldn’t stomach more struggling with ad-hoc Printf and debugger output. So I decided to generate human-readable renditions of my automata. Given that, if I still couldn’t figure out what was going on, I’d have to admit I wasn’t smart enough for this shit and walk away from the problem.

Which turned out to be a good call. Generating an information-dense but readable display was hard, and I decided to be ruthless about getting the spaces and punctuation in the right places. Because I didn’t want to walk away.

Back in the day, we used to call this “prettyprinting”.

It worked! First of all, my prettyprinter showed me that the automata emitted based on my deduplication theory were just wrong, and what was wrong about them, and I found that code and fixed it.

Bad news: My deduplication theory was also just wrong. Good news: My prettyprinter provided unavoidable proof of the wrongness and made me go back to first principles.

And I just landed a PR that cleanly removed the state explosion.

Free advice · I’ll show off the prettyprinter output below where I dig into the state-explosion fix. But for the moment, a recommendation: If you have a data structure that’s not Working As Intended and is hard to grok, go hide for a couple of days and write yourself a prettyprinter. Prettyprinting is an intelligence amplifier. Your Future Self will thank you heartily.

“Back to first principles”? · The single best write-up on NFA and regex basics that I’ve ever encountered is Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...) by Russ Cox. It’s a discussion of, and reflection on, the regular expression library constructed by Ken Thompson in the mid-Sixties, before he got mixed up in Unix.

What’s annoying is that I had read this before I started wiring NFAs into Quamina, but ignored most of its important lessons due to a combination of not understanding them and thinking that my existing code could do what Cox described. A couple of weeks ago I went back and read it again, and it all made perfect sense and showed me the way forward. So I guess the lesson is that if you’re not Ken Thompson, you’re going to have trouble understanding what he did until you’ve tried and failed yourself?

So, major thanks to Ken for this (and Unix and other things too) and to Russ for the write-up.

Epsilon transitions · These are the magic bullet that make NFA’s work. Quamina didn’t have them, now it does. There are other bits and pieces but that’s the core of the thing.

I think the easiest way to explain is by showing you an NFA as displayed by Quamina’s new prettyprinter. It matches the regular expression "x.*9" — note that the " delimiters are part of the pattern:

 758[START HERE] '"' → 910[on "]
 910[on "] 'x' → 821[gS]
 821[gS] ε → 821[gS] / '9' → 551[gX on 9]
 551[gX on 9] '"' → 937[on "]
 937[on "] 'ℵ' → 820[last step]
 820[last step]  [1 transition(s)]

There’s an API to attach labels to states as you build automata, which as a side-effect gives each a random 3-digit number too. This is done in a way that can be turned into a no-op at production time.
758: The start state; the only character that does anything is the opening " delimiter which transitions to state 910.
910: You get here when you see the " and the only exit is if you see an x, which moves to 821.
821: This state is the “glob” * operator. gS in its label stands for “glob spin”. It has an "epsilon" (ε) transition to itself. In Computer-Science theory, they claim that the epsilon transition can occur at any time, spontaneously, la-di-da. In programming practice, you take an epsilon transition for every input character. 821 also has an ordinary transition on 9 to state 551.

This possibility of having multiple transitions out of a state on the same input symbol, and the existence of epsilon transitions, are the defining characteristics that make NFAs “nondeterministic”.
551: Its label includes gX for “glob exit”. The only transition is on the closing " delimiter, to 937.
937 has only one transition, on ℵ (stands for the reserved value Quamina inserts to signal the end of input) to 820.
820 doesn’t do anything, but the [1 transition(s)] label means that if you reach here you’ve matched this field’s value and can transition to working on the next field.

Now I’m going to display the prettyprint again so you can look at it as you read the next paragraph.

 758[START HERE] '"' → 910[on "]
 910[on "] 'x' → 821[gS]
 821[gS] ε → 821[gS] / '9' → 551[gX on 9]
 551[gX on 9] '"' → 937[on "]
 937[on "] 'ℵ' → 820[last step]
 820[last step]  [1 transition(s)]

A little thought shows how the epsilon-transition magic works. Suppose the input string is "xyz909". The code will match the leading " then x and hit state 821. When it sees y and z, the only thing that happens is that the epsilon transition loops back to 821 every time. When it hits the first 9, it’ll advance to 551 but than stall out because the following character is 0 which doesn’t match the only path forward through ". But the epsilon transition keeps looping and when the second 9 comes along it’ll proceed smoothly through 551, 937, and 820, signaling a match. Yay!

So now, I have a fuzz test which adds a pattern for each of about thirteen thousand 5-letter words, with one * embedded in each at a random offset, including the leading and trailing positions. The add-pattern code hardly slows down at all. The matching code slows down a lot, to below 10,000/second, in stark contrast to most Quamina instances, which can achieve millions of matches/second.

I’m sort of OK with this trade-off; after all, it’s matching 10K-plus patterns! I’m going to work on optimizing it, but I have to accept that the math, as in finite-automata theory, might be against me. But almost certainly there are some optimizations to be had. There are possibilities suggested by Cox’s description of Thompson’s methods. And the search for paths forward will likely be good blog fodder. Yay!

Ken again · When I re-read Russ Cox’s piece, I was looking at the pictures and narrative, mostly ignoring the C code. When everything was working, I went back and was irrationally thrilled that my bottom-level function for one state traversal had the same name as Ken Thompson’s: step().

Also, when you process an NFA, you can be in multiple states at once; see the "xyz909" example above. When you’re in multiple states and you process an input symbol, you might end up in zero, one, or many new states. Russ writes, of Ken Thompson’s code, “To avoid allocating on every iteration of the loop, match uses two preallocated lists l1 and l2 as clist and nlist, swapping the two after each step.”

Me too! Only mine are called currentStates and nextStates because it’s 2024.

And thereby hangs a blog or maybe more than one. Because traversing the NFA is at Quamina’s white-hot center. You really REALLY don’t want to be allocating memory in that code path. Which should be straightforward. But it’s not, for interesting reasons that raise optimization problems I’m just starting to think about, but you’ll probably hear all about it when I do.

Un-generic · In the process of moving Quamina from DFAs to mixed DFA/NFA to pure-NFA I adopted and then abandoned Go’s generics. They hate me. Or I’m not smart enough. Or something. I wrote about the experience back in 2022 and while that piece ended inconclusively, I am personally much happier with generics-free Go code. Maybe they make other people happy.

Hard to understand · And then finally, there’s this one function I wrote in June 2022, doesn’t matter what it does. It has a a comment at the top that begins: “Spookeh. The idea is that…” and goes on for a long paragraph which, well, I can’t understand. Then I look at the code and think “that can’t work.” I keep thinking of sequences that should send it off the rails and write the unit tests and they fail to fail, and I use the prettyprinter and the NFA it generates is ruthlessly correct. I go back and look at it every few days and end up shaking my head. This is making me grumpy.

But after all, I did write, in a previous Quamina Diary episode: “The observation that computer programmers can build executable abstractions that work but they then have trouble understanding is not new and not surprising. Lots of our code is smarter than we are.”

But I’ll figure it out. And it’s nice to have interesting computer-programming stuff to blog about.

Contributions

Comment feed for ongoing:

From: Keith Wansbrough (Jun 20 2024, at 08:01)

I love this, and totally agree on the importance of good visualizations.

You are using a different definition of "epsilon transition" than the one I think is standard: normally an epsilon transition doesn't consume a symbol. They're handy for things like representing alternation: M(A|B) can be built from M(A) and M(B) by adding a new start state and two epsilon transitions. The thing you're calling "epsilon" looks more like the good old "." operator (or character classes).

[link]

From: Zellyn Hunter (Jun 20 2024, at 08:19)

My weapon of choice nowadays is to print out graphviz text, and copy/paste it into https://viz-js.com/

Here's the start of some code I'm currently using to debug things:

```

connector = '->'

output = ['digraph {']

output += [' rankdir=LR;']

output += [' splines=false;']

if not digraph:

output = ['graph {']

connector = '--'