The murderer I emailed with is still in prison. And the software that got him pissed off at me still runs, so I ran it. Now here I am to pass on the history and then go all geeky. Here’s the tell: If you don’t know what a “filesystem” is (that’s perfectly OK, few reasonable adults need to) you might want to stay for the murderer story then step off the train.
Filesystems are one of the pieces of software that computers need to run, where “computers” includes your phone and laptop and each of the millions of servers that drive the Internet and populate the cloud. There are many flavors of filesystem and people who care about them care a lot.
One of the differences between filesystems is how fast they are. This matters because how fast the apps you use run depends (partly) on how fast the underlying filesystems are.
Writing filesystem software is very, very difficult and people who have done this earn immense respect from their peers. So, a lot of people try. One of the people who succeeded was named Hans Reiser and for a while his “ReiserFS” filesystem was heavily used on many of those “Linux” servers out there on the Internet that do things for you.
Reiser at one point worked in Russia and used a “mail-order bride” operation to look for a spouse. He ended up marrying Nina Sharanova, one of the bride-brokerage translators, and bringing her back to the US with him. They had two kids, got divorced, and then, on September 3, 2006, he strangled her and buried her in a hidden location.
To make a long story short, he eventually pleaded guilty to a reduced charge in exchange for revealing the grave location, and remains in prison. I haven’t provided any links because it’s a sad, tawdry story, but if you want to know the details the Internet has them.
I had interacted with Reiser a few times as a consequence of having written a piece of filesystem-related software called “Bonnie” (more on Bonnie below). I can’t say he was obviously murderous but I found him unpleasant to deal with.
As you might imagine, people generally did not want to keep using the murderer’s filesystem software, but it takes a long time to make this kind of infrastructure change and just last month, ReiserFS was removed as a Linux option. Which led to this Mastodon exchange:
(People who don’t care about filesystems can stop reading now.)
Now, numbers · After that conversation, on a whim I tracked down the Bonnie source and ran it on my current laptop, a 2023 M2 MacBook Pro with 32G of RAM and 3T of disk. I think the numbers are interesting in and of themselves even before I start discoursing about benchmarking and filesystems and disks and so on.
-------Sequential Output--------- ---Sequential Input--- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine GB M/sec %CPU M/sec %CPU M/sec %CPU M/sec %CPU M/sec %CPU /sec %CPU
MBP-M2-32G 64 56.9 99.3 3719 89.0 2772 83.4 59.7 99.7 6132 88.0 33613 33.6
Bonnie says:
This puppy can write 3.7 GB/second to a file, and read it back at 6.1GB/sec.
It can update a file in place at 2.8 GB/sec.
It can seek around randomly in a 64GB file at 33K seeks/second.
Single-threaded sequential file I/O is almost but not quite CPU-limited.
I wonder: Are those good numbers for a personal computer in 2024? I genuinely have no idea.
Bonnie · I will shorten the story, because it’s long. In 1988 I was an employee of the University of Waterloo, working on the New Oxford English Dictionary Project. The computers we were using typically had 16MB or so of memory (so the computer I’m typing this on has two thousand times as much) and the full text of the OED occupied 572MB. Thus, we cared really a lot about I/O performance. Since the project was shopping for disks and computers I bashed out Bonnie in a couple of afternoons.
I revised it lots over the years, and Russell Coker made an excellent fork called Bonnie++ that (for a while at least) was more popular than Bonnie. Then I made my own major revision at some point called Bonnie-64.
In 1996, Linux Torvalds recommended Bonnie, calling it a “reasonable disk performance benchmark”.
That’s all I’m going to say here. If for some weird reason you want to know more, Bonnie’s quaint Nineties-flavor home and description pages are still there, plus this blog has documented Bonnie’s twisty history quite thoroughly. And explored, I claim, filesystem-performance issues in a useful way.
I will address a couple of questions here, though.
Do filesystems matter? · Many performance-sensitive applications go to a lot of work to avoid reading and/or writing filesystem data on their critical path. There are lots of ways to accomplish this, the most common being to stuff everything into memory using Redis or Memcached or, well, those two dominate the market, near as I can tell. Another approach is to have the data in a file but access it with mmap rather than filesystem logic. Finally, since real disk hardware reads and writes data in fixed-size blocks, you could arrange for your code to talk straight to the disk, entirely bypassing filesystems. I’ve never seen this done myself, but have heard tales of major commercial databases doing so.
I wonder if anyone has ever done a serious survey study of how the most popular high-performance data repositories, including Relational, NoSQL, object stores, and messaging systems, actually persist the bytes on disk when they have to?
I have an opinion, based on intuition and having seen the non-public inside of several huge high-performance systems at previous employers that, yes, filesystem performance still matters. I’ve no way to prove or even publicly support that intuition. But my bet is that benchmarks like Bonnie are still relevant.
I bet a few of the kind of people who read this blog similarly have intuitions which, however, might be entirely different than mine. I’d like to hear them.
What’s a “disk”? · There is a wide range of hardware and software constructs which are accessed through filesystem semantics. They have wildly different performance envelopes. If I didn’t have so many other hobbies and projects, it’d be fun to run Bonnie on a sample of EC2 instance types with files on various EBS and EFS and so on configurations.
For the vast majority of CPU/storage operations in the cloud, there’s at least one network hop involved. Out there in the real world, there is still really a lot of NFS in production. None of these things are much like that little SSD slab in my laptop. Hmmm.
Today’s benchmarks · I researched whether some great-great-grandchild of Bonnie was the new hotness in filesystem benchmarking, adopting the methodology of typing “filesystem benchmark” into Web search. The results were disappointing; it doesn’t seem like this is a thing people do a lot. Which would suggest that people don’t care about filesystem performance that much? Which I don’t believe. Puzzling.
Whenever there was a list of benchmarks you might look at, Bonnie and Bonnie++ were on that list. Looks to me like IOZone gets the most ink and is thus probably the “industry-leading” benchmark. But I didn’t really turn up any examples of quality research comparing benchmarks in terms of how useful the results are.
Those Bonnie numbers · The biggest problem in benchmarking filesystem I/O is that Linux tries really hard to avoid doing it, aggressively using any spare memory as a filesystem cache. This is why serving static Web traffic out of the filesystem often remains a good idea in 2024; your server will take care of caching the most heavily fetched data in RAM without you having to do cache management, which everyone knows is hard.
I have read of various cache-busting strategies and have never really been convinced that they’ll outsmart this aspect of Linux, which was written by people who are way smarter and know way more than I think I do. So Bonnie has always used a brute-force approach: Work on a test file which is much bigger than main memory, so Linux has to do at least some real I/O. Ideally you’d like it to be several times the memory size.
But this has a nasty downside. The computer I’m typing on has 32GB of memory, so I ran Bonnie with a 64G filesize (128G would have been better) and it took 35 minutes to finish. I really don’t see any way around this annoyance but I guess it’s not a fatal problem.
Oh, and those numbers: Some of them look remarkably big to me. But I’m an old guy with memories of how we had to move the bits back and forth individually back in the day, with electrically-grounded tweezers.
Reiser again · I can’t remember when this was, but some important organization was doing an evaluation of filesystems for inclusion in a big contract or standard or something, and so they benchmarked a bunch, including ReiserFS. Bonnie was one of the benchmarks.
Bonnie investigates the rate at which programs can seek around in a file by forking off three child processes that do a bunch of random seeks, read blocks, and occasionally dirty them and write them back. You can see how this could be stressful for filesystem code, and indeed, it occasionally made ReiserFS misbehave, which was noted by the organization doing the benchmarking.
Pretty soon I had email from Reiser claiming that what Bonnie was doing was actually violating the contract specified for the filesystem API in terms of concurrent write access. Maybe he was right? I can’t remember how the conversation went, but he annoyed me and in the end I don’t think I changed any code.
Here’s Bonnie · At one time Bonnie was on SourceForge, then Google Code, but I decided that if I were going to invest effort in writing this blog, it should be on GitHub too, so here it is. I even filed a couple of bugs against it.
I make no apologies for the rustic style of the code; it was another millennium and I was just a kid.
I cheerfully admit that I felt a warm glow checking in code originally authored 36 years ago.
Comment feed for ongoing:
From: Charlie Sauer (Dec 04 2024, at 11:41)
I've used Bonnie this year! As you suggest, there isn't an obvious successor. I used it in 2021 to try to assess SVR4 running on 86Box: https://notes.technologists.com/notes/2021/01/19/koko-dell-unix-sustainable/. Every now and then I try to pick up what I was doing with https://notes.technologists.com/notes/2022/01/10/koko-misp-2022/. That is what led me to use Bonnie earlier this year. I'm slowly adding to my SPEC89 spreadsheet, and when I do, I add columns with Bonnie results.
[link]
From: Fazal Majid (Dec 04 2024, at 11:58)
I use Bonnie++, it’s more pleasant than iozone. But nowadays disks are so fast and filesystems good enough the bottleneck is usually naive application software.
What I care for a lot more nowadays is correctness. Many filesystems will not actually flush to disk when you fsync, trading performance for the risk of data loss. What’s worse, many disks will also lie about their write caches being flushed to NAND Flash or spinning rust. In a database application, this can be a prescription for disaster. That’s why I use ZFS when it matters.
[link]
From: Jacek Kopecky (Dec 04 2024, at 12:15)
Perhaps memory compression helps the Mac get such huge I/O block numbers - how much randomness does Bonnie put in the 64GB file? Maybe there's even disk compression...
[link]
From: dpf (Dec 04 2024, at 16:04)
> I wonder if anyone has ever done a serious survey study of how the most popular high-performance data repositories, including Relational, NoSQL, object stores, and messaging systems, actually persist the bytes on disk when they have to?
in a previous life working on low-level storage, yes. at least at the level of talking to the storage devices we looked at traces, and they were mostly horrible because they didn't want to lose data but also didn't want to wait around for 'no' reason.
[link]
From: Tim (but not THE Tim) (Dec 04 2024, at 20:08)
I just wanted to say that I really appreciated, and related to, the line "But I’m an old guy with memories of how we had to move the bits back and forth individually back in the day, with electrically-grounded tweezers"
[link]
From: Jonathan Buys (Dec 05 2024, at 02:02)
Last I remember, SSDs have a fixed number of writes they can do in their life. I’d be hesitant to use any file system performance test that writes over and over to it, if the test is killing my drive a little more every time. Maybe that’s not the case anymore? It’s been a while since I’ve looked into it.
[link]
From: Doug K (Dec 05 2024, at 15:10)
good stories ;-)
one of the products I support is a high-speed (for values of) message broker, which has to persist messages to filesystem when the consumers slow down. Since the clouds came down, 'what is a disk' is a fine question. Often and often, our broker gets kneecapped by slow file i/o to a file system that is many hops and protocols away.. debugging that is just a joy.
[link]