Unbackslash

Old software joke: “After the apocalypse, all that’ll be left will be cockroaches, Keith Richards, and markup characters that have been escaped (or unescaped) one too many (or few) times.” I’m working on a programming problem where escaping is a major pain in the ass, specifically “\”. So, for reasons that seem good to me, I want to replace it. What with?

The problem · My Quamina project is all about matching patterns (not going into any further details here, I’ve written this one to death). Recently, I implemented a “wildcard” pattern, that works just like a shell glob, so you can match things like *.xlsx or invoice-*.pdf. The only metacharacter is *, so it has basic escaping, just \* and \\.

It wasn’t hard to write the code, but the unit tests were a freaking nightmare, because \. Specifically, because Quamina’s patterns are wrapped in JSON, which also uses \ for escaping, and I’m coding in Go, which does too, differently for strings delimited by " and `. In the worst case, to test whether \\ was handled properly, I’d have \\\\\\\\ in my test code.

It got to the point that when a test was failing, I had to go into the debugger to figure out what eventually got passed to the library code I was working on. One of the cats jumped up on my keyboard while I was beset with \\\\ and found itself trying to tread air. (It was a short drop onto a soft carpet. But did I ever get glared at.)

Regular expressions ouch · That’s the Quamina feature I’ve just started working on. And as everyone knows, they use \ promiscuously. Dear Reader, I’m going to spare you the “Sickening Regexps I Have Known” war stories. I’m sure you have your own. And I bet they include lots of \’s.

(The particular dialect of regexps I’m writing is I-Regexp.)

I’ve never implemented a regular-expression processor myself, so I expect to find it a bit challenging. And I expect to have really a lot of unit tests. And the prospect of wrangling the \’s in those tests is making me nauseous.

I was telling myself to suck it up when a little voice in the back of my head piped up “But the people who use this library will be writing Go code to generate and test patterns that are JSON-wrapped, so they’re going to suffer just like you are now.”

Crazy idea · So I tried to adopt the worldview of a weary developer trying to unit-test their patterns and simultaneously fighting JSON and Go about what \\ might mean. And I thought “What if I used some other character for escaping in the regexp? One that didn’t have special meanings to multiple layers of software?”

“But that’s crazy” said the other half of my brain. Everyone has been writing things like \S+\.txt and [^{}[\]]+ for years and just thinks that way. Also, the Spanish Inquisition.”

Whatever; like Prince said, let’s go crazy.

The new backslash · We need something that’s visually distinctive, relatively unlikely to appear in common regular expressions, and not too hard for a programmer to enter. Here are some candidates, in no particular order.

For each, we’ll take a simple harmless regexp that matches a pair of parentheses containing no line breaks, like so:

Original: \([^\n\r)]*\)

And replace its \‘s with the candidate to see what it looks like:

Left guillemet: « · This is commonly used as open-quotation in non-English languages, in particular French. “Open quotation” has a good semantic feel; after all, \ sort of ”quotes” the following character. It’s visually pretty distinctive. But it’s hard to type on keyboards not located in Europe. Speaking of developers sitting behind those keyboards, they’re more likely to want to use « in a regexp. Hmm.

Sample: «([^«n«r)]*«)

Em dash: — · Speaking of characters used to begin quotes, Em dash seems visually identical to U+2015 QUOTATION DASH, which I’ve often seen as a quotation start in English-language fiction. Em dash is reasonably easy to type, unlikely to appear much in real life. Visually compelling.

Sample: —([^—n—r)]*—)

Left double quotation mark: “ · (AKA left smart quote.) So if we like something that suggests an opening quote, why not just use an opening quote? There’s a key combo to generate it on most people’s keyboards. It’s not that likely to appear in developers’ regular expressions. Visually strong enough?

Sample: “([^“n“r)]*“)

Pilcrow: ¶ · Usually used to mark a paragraph, so no semantic linkage. But, it’s visually strong (maybe too strong?) and has combos on many keyboards. Unlikely to appear in a regular expression.

Sample: ¶([^¶n¶r)]*¶)

Section sign: § · Once again, visually (maybe too) strong, accessible from many keyboards, not commonly found in regexps.

Sample: §([^§n§r)]*§)

Tilde: ~ · Why not? I’ve never seen one in a regexp.

Sample: ~([^~n~r)]*~)

Escaping · Suppose we used tilde to replace backslash. We’d need a way to escape tilde when we wanted it to mean itself. I think just doubling the magic character works fine. So suppose you wanted to match anything beginning with . in my home directory: ~~timbray/~.*

“But wait,” you cry, “why are any of these better than \?” Because there aren’t other layers of software fighting to interpret them as an escape, it’s all yours.

You can vote! · I’m going to run a series of polls on Mastodon. Get yourself an account anywhere in the Fediverse and follow the #unbackslash hashtag. Polls will occur on Friday September 27, in reasonable Pacific times. Of course, one of the options will be “Don’t do this crazy thing, stick with good ol’ \!”

Contributions

Comment feed for ongoing:

From: Dagon (Sep 25 2024, at 15:13)

I miss Perl for a lot of reasons (and I don't miss it for many of the same reasons). One of the big ones was quote operators.

Instead of "..." or '...', you could use qq(...) or q(...). The language was flexible in what delimiters were used, so q[...] or q^...^ were valid. This didn't fully solve the leaning-toothpick problem, but it removed one level of it, and made it much easier to keep things clean.

[link]

From: Pete Forman (Sep 25 2024, at 15:51)

In reply to Dagon

sed is similar. In the common s/foo/bar/ the delimiter can be any character, not just /.

[link]

From: Andrew Reilly (Sep 25 2024, at 19:08)

How about skipping quoting altogether by enlarging the character set? Change the syntax of your regexps to separate all tokens with a space, and spell out the special characters with words or short strings, such as the set in man (1) ascii? That way you can have ( to match a parenthesis and, say, m( to mark the start of a bracketed expression.

It's not as though regular-expression conciseness is a performance criterion.

I also quite like the expression matching in racket-lang.

[link]

From: Gavin B (Sep 26 2024, at 01:33)

Candidate: ¬ (Negation Sign)

A narrative could be:

Go along & down (¬)

to the next character then

do NOT (¬) take it literally.

https://www.ascii-code.com/CP1252/172

[link]

From: Ed Davies (Sep 26 2024, at 01:50)

Awkward and error-prone character manipulation? Perhaps we could add some functions to the programming language we have to hand:

sequence(literal("~timbray/"), zero_or_more(any_char()))

to build a sensible syntax tree then write it out with appropriate levels of escaping thereby avoiding the mistake-laden cognitive load of thinking about this convoluted syntax at the same time as we're thinking about what we want to use or test.

This could even be done in the JSON, though in a slightly more convoluted way. Perhaps easier in XML of course but maybe we could ask somebody involved in the definition of one or other of those.

[link]

From: Robert Sayre (Sep 26 2024, at 11:26)

The phrase is attributed to Dave Walker (@ffg). I don't think I'll forget that one.

[link]

ongoing

What this is ·

Truth · Biz · Tech

author · Dad
colophon · rights

September 22, 2024
· Technology (90 fragments)
· · Quamina Diary (13 more)
· · Software (82 more)

By Tim Bray.

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!