Here’s a nice little RFC describing a nice little trick that might even be useful. Short form: People like to write JSON into logfiles. Text sequences make reading them easier and more robust.
The trick · You precede each JSON log entry with little-known Unicode character U+001E INFORMATION SEPARATOR TWO, and you stick a newline (0xA) after it.
This makes it easy for a log reader to pick the byte stream apart into chunks it can hand its friendly local JSON parser, and cleanly survive the not-terribly-uncommon scenario where something blew chunks while logging and left behind a truncated/malformed entry.
I’m not going to reproduce the RFC’s narrative; it’s perfectly transparent. I think this might actually find fairly widespread use. I’ll be showing it to some folks here at AWS, maybe someone will be interested. Boy, do we ever have a lot of logs.
Software archaeology · What makes this mildly amusing is that U+001E has a secret identity; it’s also an ASCII “Control character” called RS for Record Separator. I and every other text-encoding geek have long thought the control characters an irritating waste of space; XML 1.0 flatly forbid using them because they don’t mean anything and have no use.
Except for, this one is being put to something like its original use, all these years later. Are there any ASCIInauts still living who’d know if there’s a story behind RS?
Kudos for Nico Williams for getting this done.
Comment feed for ongoing:
From: Caleb Ames (Feb 27 2015, at 07:35)
I don't know the story behind RS, but another place it still gets used is as a field separator in MARC, a library format for the interchange of bibliographic data. It's nice to see these obscure characters get trotted out on occasion.
[link]
From: John Cowan (Feb 27 2015, at 08:40)
The idea was to use FS, GS, RS, and US as hierarchical separators with depth up to four, so US and RS served the purposes of comma and newline in CSV format, for example. As far as I know, nobody ever actually did that. Note that Space comes immediately after US and can be thought of as an even lower-level separator.
Here's my take on ASCII from some years ago:
There’s no ASCII like US-ASCII,
It’s no ASCII I know;
Everything about it is U.S.-based,
Everything about it’s seven-bit.
You do that X-three-dot-four wheeler-dealing
When you’re feeling
Retro-fit.
There’s no ASCII like US-ASCII,
It’s Unicode’s first half-row;
Although it is a turkey that we know must die,
While systems chop off the bits that are high,
It is the only charset that will always fly,
ASCII, backward we go!
ASCII, on with the show!
[link]
From: Eric Fischer (Feb 27 2015, at 09:12)
RS and the other ASCII separators were reduced from an original block of 8 of them in ASCII-1963 as part of the process that moved ESC and ACK into the main block of control characters.
That block of 8 got into ASCII as part of Hugh McGregor Ross's rationalization of the controls into four blocks: switching controls, page format controls, information separators, and terminal controls, so that a 6-bit subset for a particular use could choose the block of 4 or 8 that made sense for its application to replace some of the punctuation. (In practice, nobody ever did this.)
It's unclear from the meeting minutes exactly how the separators became one of the four blocks, but it seems like they were basically an attempt to make US computer people believe that control characters were worth having at all. Communications people really wanted lots of control characters, which is why there are so many controls wasted on long-obsolete in-band switching protocols, and UK computer people wanted to have newline at least. US computer people were still strong believers in punch-card file formats of exactly 80 printing characters apiece and didn't want to waste any code points on non-printing characters they would never use. But apparently even they could see the utility of field separators, like the four special separators of OCR-A, so that part was theirs.
And then the Teletype Model 33 became the de facto standard implementation of ASCII, so anything it didn't respond to might as well not have existed, so most of the controls got pressed into alternate service as commands with letter mnemonics instead of ever being used for their intended function.
[link]
From: Nelson (Feb 27 2015, at 14:37)
It is a nice little RFC. I've been using 0x1c (file separator) as a way to do multipart uploads from Javascript; sure beats trying to figure out MIME multipart encoding on a POST or whatever "the standard" is. I was afraid at first it'd break something, but it doesn't.
As a side benefit, 0x1c and 0x1e are so uncommon that you can probably get away with not escaping them, simplifying the encoding further. Well they used to be; maybe not anymore.
[link]
From: Ed Davies (Feb 27 2015, at 14:39)
In the past I've used lines consisting of just three HYPHEN-MINUS characters to separate JSON values in a log file. The idea came from a slight misunderstanding of a similar thing in YAML which is a superset of JSON. I think I managed to convince myself that that wouldn't be a plausible content in any sensible JSON string.
Sticking to printable characters has its attractions though I expect most editors will display RS sensibly.
[link]
From: J. King (Feb 28 2015, at 20:04)
I used record separators for output of multiple logs to a single file or stdout in some software I wrote last year. Maybe not the most obviously friendly way to do it (multi-part MIME might have been clearer for a human to scan, say), but it was simple to implement and is easy to parse. Why complicate things?
[link]
From: Jörg Prante (Mar 01 2015, at 07:08)
The "information separator" characters RS, GS, FS, und US were standardized first in the early 1960s by ECMA, e.g.
http://www.ecma-international.org/publications/files/ECMA-ST-WITHDRAWN/Ecma%20philosophy%20on%20codes.pdf
and the 7-bit character code ECMA-7
http://www.ecma-international.org/publications/files/ECMA-ST-WITHDRAWN/ECMA-7,%201st%20Edition,%20April%201965.pdf
and this was later adapted for ASCII-1967.
The codes appeared also in EBCDIC when the IBM/360 was released in 1964.
The reason for these control codes were the new requirements for structured data on magnetic tape.
Example of usage is given here
http://www.worldpowersystems.com/J/codes/index.html#MESSAGE
Libraries had success for over 40 years with information separators in MARC and are deprecating this format nowadays for RDF-based graphs, and serializing to RDF/XML, Turtle, or JSON-LD.
I think it would make sense to use all the available information separators for parsing JSON instead of only one, in particular, when parsing JSON-LD.
[link]
From: Nelson (Mar 04 2015, at 11:40)
One more resource that is of historical interest: "The Evolution of Character Codes, 1874-1968" by Eric Fischer. A very detailed and well referenced paper looking at how ASCII came to be, starting with early telegraph encodings. It doesn't specifically talk about separators but is certainly of general interest.
http://trafficways.org/ascii/ascii.pdf
[link]