This was a long weekend in Canada; since I’m unemployed and have no workaday cares, I should have plenty of time to do family stuff. And I did. But I also got interested in a small programming problem and, over the course of the weekend, built a tiny tool called topfew. It does a thing you can already do, only faster, which is what I wanted. But I remain puzzled.
[To those who are interested in Amazon or politics or photography, feel free to skip this. I’ll come back to those things.]
What it does · Back in 2016 I tweeted Hello "sort | uniq -c | sort -rn | head" my old friend, and got a few smiles. If you feed that incantation a large number of lines of text it produces a short list showing which among them appear most often, and how many times each of those appears.
It’s a beautiful expression of the old Unix craft of building simple tools that you can combine to produce surprising and useful results. But when the data gets big, it can also be damn slow. I have a little script on the ongoing Web server that uses this technique to tell me which of my pieces are most popular, and how many times each of those was read.
In the last couple of weeks, my logfiles have gotten really big, and so my script has gotten really slow. As in, it takes the best part of a minute to run and, while it’s running, it sucks the life out that poor little machine.
It turns out to be a pretty common problem. Anyone who’s operating an online service finds themselves processing logfiles to diagnose problems or weirdness, and you regularly need to answer questions like “who are the top requestors?” and “what are the most common API calls?”
What it is ·
Topfew is a simple-minded little Go program, less than 500 lines of code and almost half of that tests. If you have an Apache
server log named access_log
, and you wanted to find out what the top twelve IP addresses were hitting your server
hardest, you could say:
awk '{print $1}' access_log | sort | uniq -c | sort -rn | head -12
With topfew, you’d say:
tf -fields 1 -few 12 access_log
(“tf” is short for “topfew”.)
The two would produce almost the same output, except for topfew would be faster. How much faster? Well, that’s complicated (see below)… but really quite a bit.
I don’t think there’s anything terribly clever about it; My bet is that if you asked a dozen senior developers to write this, they’d all end up with nearly identical approaches, and they’d all be pretty fast. Anyhow, as the links here make obvious, it’s on GitHub. If anyone out there needs such a thing and hasn’t written their own in a weekend project, feel free to use and improve.
Since I haven’t done a Go project from scratch in a long time, I’m not only out of practice but out of touch with current best practices, so it’s probably egregiously unidiomatic in lots of ways. Oh well.
How fast? ·
Well, that’s interesting. I ran a few tests on my Mac and it was many many times faster than the sort/uniq incantation, so I
smiled. I thought “Let’s try bigger data!” and built a Linux version (saying GOOS=linux go build
still feels like
living in the future to me), hopped over to my Web server, and ran exactly those queries from the last section on a
recent Apache logfile (8,699,657 lines, 2.18G). Topfew was faster, but not many times faster like on my
Mac. So I ran it on the Mac too. Here’s a summary of the results. All the times are in seconds.
cp time | sort/uniq time | topfew time | speedup net of I/O |
|
---|---|---|---|---|
Linux | 17.29 | 37.66 | 26.75 | 2.15x |
Mac | 1.40 | 137.46 | 8.81 | 18.36x |
The first column (“cp time”) is the number of seconds it takes to copy the big file from one directory to another. The last column is produced by subtracting the “cp” time from each of the other two, leaving something like the amount of extra latency introduced by the computing in excess of the time spent just copying data, then comparing the results.
Anyhow, those are some puzzling numbers. Stray thoughts and observations that I’m not going to dignify with the term “hypotheses”:
The Mac has a damn fast SSD.
The Linux box’s disk is sluggish.
Perhaps the basic tools like sort
and uniq
are surprisingly fast on Linux or the Go runtime is
surprisingly slow. Or, you know, the mirror image of that statement.
The sort/uniq incantation is inherently parallel while topfew is 100% single-threaded. Maybe the Mac is dumb about scheduling processes in a shell pipeline across its eight cores?
Giggling · I broke out the Go profiler (Go is batteries-included in ways that other popular languages just aren’t) and ran that command against the Apache logfile again. I was poking around the output and Lauren said “I hope you’re laughing at the computer, not at me talking to the cat.”
The profiler claimed that 73% of the time was spent in syscall.syscall
reading the input data, and the first trace
of actual application code was 2.27% of the time being spent in the regexp library doing field splitting (on
\s+
).
The
core code that actually accumulates counts for
keys was consuming 1.54% of the elapsed time, and 93% of that was updating the total in a map[string]uint64
.
Next steps? · I’m probably curious enough to poke around a little more. And maybe someone reading this will kindly explain to me in words of one syllable what’s going on here. But anyhow, the tool is useful enough as it is to save me a noticeable amount of time.
Comment feed for ongoing:
From: Christian Winkler (May 19 2020, at 01:53)
I am using "sort|uniq -c|sort -nr" all the time, a faster and more intuitive way would be excellent!
Is the Linux file in the page cache? You can enforce that via "vmtouch -tv $FILENAME".
Have you recorded the total time? What is the user CPU time and how is the IO load?
[link]
From: Fazal Majid (May 19 2020, at 04:31)
There are fast streaming algorithms for the TopN problem in constant memory, and implementations like the PostgreSQL TopN extension:
https://github.com/citusdata/postgresql-topn/
The first one was the Misra-Gries algorithm, but there have been improvements since:
https://arxiv.org/abs/1604.01357
These algorithms, along with related ones like HyperLogLog (estimates the number of unique items in a stream) or Counting Quotient Filter (best known alternative to a Bloom Filter) have made dramatic improvements to how huge data sets can be analyzed.
[link]
From: Dirkjan Ochtman (May 19 2020, at 05:24)
Just because I was interested in how Go compares to Rust in this kind of scenario, I reimplemented your tool in Rust: https://github.com/djc/topfew-rs. The code is highly similar to yours, although I guess I might've sneaked in a few minor optimizations because Rust makes you think more about memory allocation.
On my (spinning HD) Linux server, I get 2.86x performance out of my release build Rust code vs your binary (naively built with "go build").
SLOC: Rust 145, Go 181.
This was a fun exercise.
[link]
From: Meower68 (May 19 2020, at 08:01)
When I've needed to do things like this, on a running server machine, I usually end up throwing a "nice -n 5" or something in front of it, such that Linux will deprioritize it and not slow down the main processes on the machine. In my experience, it slows it down only slightly, but causes no perceptible hit in performance on anything that truly matters.
I do a LOT of grep, sort and awk on large logfiles, frequently in excess of 1 GB in size. If I have something which will take a while to run, I frequently append a " | mail <address>" on the end, such that the output gets emailed to me, instead of just piping it to less for viewing and searching. Naturally, that needs to be installed but the vast majority of machines I've worked with had it. I've had processes which ran for over 4 hours (including downloading and unzipping over 200 GB of logs), with the end result being distilled to less than 100 KB. The results just popped up in my inbox when it was done. If the output comes out in some CSV format, it's not hard to script something which will email me with the resulting file attached. Not very interactive but ... if you REALLY need to know the results, you KNOW it's going to take a while and you REALLY need to work on something else in the meantime, it gets the job done.
You are correct about pipe using multiple cores. I learned that on a Linux 2.2 system with dual Pentium Pro 180s, back in college (and if you don't remember what a PPro was ... get off my lawn). The only way some monolithic tool is going to keep up with that is if it's written explicitly multi-threaded AND, even then, it depends on how well your platform implements that multi-threading. Not all multi-threading libraries are equal on all platforms. If all you have is a single-core machine, a monolithic tool might be the better answer. But if you're multi-core ... it's REALLY hard to beat simple Unix tools and pipe.
[link]
From: Pavel (May 19 2020, at 22:34)
In hypothesis 4, you’re saying ‘sort | uniq’ is somehow taking advantage of parallel processing. But in this case before data arrives at uniq, doesn’t it need to be fully processed by sort? in fact each sort in that pipeline acts like parallelism buster, no?
[link]
From: Marcel Weiher (May 22 2020, at 07:59)
In my experience, the locale settings can have order(s)-of-magnitude scale impact on the performance of command line tools on macOS.
And the defaults are all set to 'slow'.
Another potential issue may be that some utilities on macOS have buffer sizes that are way, way too small to take advantage of their insanely fast SSDs.
For example, mkfile is compiled with a 512 byte buffer and achieves only 250MB/s out of the box, whereas recompiling with a 1MB buffer maxes out my SSD at ~2GB/s.
https://blog.metaobject.com/2017/02/mkfile8-is-severely-syscall-limited-on.html
Of course, you just have a pipeline that reads a large file and only returns a tiny amount of data, but the man page says that sort(1) might be hitting the disk by itself when it has to handle a large data set.
[link]
From: Seth.L (May 23 2020, at 06:53)
I don't think "cp time" is the right baseline. That's reading AND writing the file, while everything you're comparing to is just reads. Large file + spinning disk will make the Linux numbers especially bad.
The right comparison would be "cat time".
[link]
From: Lukas (Aug 04 2020, at 03:52)
Regarding golang best practices: It would be nice, if you could provide release builds for your software. https://goreleaser.com/intro/ would be a recommendable start.
[link]