I’ve been working on, and writing about, running Clojure Wide Finder code. But I was never satisfied with the absolute performance numbers. This is a write-up in some detail as to how I made the code faster and also slower, including lessons that might be useful to those working on Clojure specifically, concurrency more generally, and with some interesting data on Java garbage collection and JDK7.
[Update: My relationship with the JVM is improving, based on good advice from the Net. This article has been partly re-written and is much more cheerful.]
[This is part of the Concur.next series and also the Wide Finder 2 series.]
In a comment on my last piece, Alexy Khrabrov noted “I saw WF2 has a pretty good Scala result, apparently better than Clojure's — and that's a year-old Scala.” Alexy is right; my best times were mostly over 30 minutes, while Ray Waldin’s Scala code ran in under 15. There’s no obvious reason I can see why Clojure should be significantly slower than Scala, and while I stand by my Eleven Theses on Clojure, I was puzzled and irritated.
The following are organized roughly chronologically; if the narrative isn’t all that coherent, that would be an indicator of me having done considerable thrashing about.
Low-Hanging Fruit ·
The information you can get from the JVM with the
-Xprof
argument is not brilliant, but it’s a whole lot better
than nothing. What you want to see in profiling output is a big fat obvious
culprit, and I did, and it was the code fragment below, which breaks up
buffers into lines; text
is the text we’re wrangling,
first-nl
and last-nl
are the first and last places
where a newline character(whose value, note, is
10) appears in the block. destination
is the user-provided
payload function.
(loop [ nl first-nl ]
(let [ from (inc nl) to (. text indexOf 10 from) ]
(if (< from last-nl)
(do
(destination (new String (. text substring from to)) accum user-data)
(recur to)))))))
This sucker was burning well over half of all the CPU. There are a few things wrong with it. First of all, since I was already using Java to split up the text, why not do it all at once like this?
(let [chunks (. #"\n" split text)
Yep, that helped. (I’d actually had it that way in a previous iteration but had unrolled it while fighting a memory leak that turned out to be unrelated).
Lesson: Any time you can package up a bunch of work as a single call into Javaland, that’s probably a good idea.
Reducing ·
The second thing that was wrong was that I was iterating at all. Lisps
want you to think in terms of lists not their members, and I wasn’t.
What I was actually wanted to do was to call the payload function on each line
of text, passing the output of each invocation to the input of the next. And
of course Clojure, like any decent functional language, has a nice
reduce
function. So let’s just add a line to that fragment:
(let [chunks (. #"\n" split text)
; ... turn "chunks" into "lines" by stripping leading/trailing fragments ...
accumulator (reduce per-line nil lines)
Lispers are now nodding their heads in a despairing
of-course-the-moron-should-have-done-it-that-way way. Me, I like it when a profiler
shows me where I’m being stupid. Some non-Lispers are probably thinking
“That’s slick”. Most modern languages, not just Lisps, have some sort of a
reduce
equivalent.
I’m a bit amused here: this code has (basically) a map/reduce architecture,
except for I’m using reduce
in the map phase.
Lesson: Operate on lists instead of iterating.
Type Hinting ·
The code was now running visibly faster; time for another whack at the
profiler. I saw a whole lot of time being sucked up by
java.lang.Class.forName0
. This suggested to me that the Clojure
runtime was spending a bunch of time trying to figure out
what type of thing was crossing some method-dispatch barrier. I already had a
wrapper around the user-provided payload function because it has three
arguments and reduce
wants just two, and to protect myself from
Java
issue 4513622.
I seemed to remember from
somewhere that Clojure has “type hints”, so I looked that up and decorated the
wrapper:
(let per-line (fn [accum #^String line] (destination (new String line) user-data accum))
See that #^String
goober? It’s a hint telling Clojure
to assume that the incoming line
argument is a Java string.
It helped too, quite a bit. I was sort of irritated at having to hold
the system’s hand like this. But I think this may be a Release-1.0 symptom
rather than a Clojure design flaw. It feels to me like there’s quite a bit
more scope for type inferencing, maybe even JIT inferencing at run-time.
Lesson: A little static typing can go a long way.
Reading vs. Mapping ·
My first cut at this problem used Java’s NIO subsystem; I’d noted that
java.nio.channels.FileChannel
advertises itself as being
thread-safe and furthermore had a map()
method for mapping stuff
into memory. So I had this pretty slick (I thought) setup where multiple
threads would just get dealt out an offset and block size and map successive
regions of the file, then do the string-i-fying and splitting.
I was twiddling buffer sizes and thread counts and not observing any real
helpful patterns and wondered if I was maybe overthinking the problem. So I
made another version that uses a Clojure agent
to synchronize
sucking in the data with good old-fashioned
java.io.FileInputStream.read()
and fires off a new thread for
each block. Right off the bat, it was faster.
Experimenting with this one seemed to suggest that it ran faster and faster
the smaller I made the buffers. I’d started with 64M and eventually went all
the way down to 2M; 1M wasn’t any faster. I hypothesize that with the big
buffers, the big String.split
operation illustrated above was
creating too many short-lived transient objects that were overloading the
garbage collector.
Lesson: Disk really is the new tape.
Stabilizing the JVM · Early on, when I was fighting a memory leak, Rich Hickey suggested using a smaller heap size. I did that, and was astounded.
I was now seeing throughput on the order of 50-60M/sec. But as I watched
the program run, I could see that as the program built up its heap to the
ceiling (set via java’s -Xmx
argument), it was running much
slower, as slow as a tenth of that speed. When it hit the ceiling, absent
other pathologies, it would ramp up to production speed, I assume as the GC
got into balance with the rest of the program’s work.
Since it can easily take minutes for the JVM to expand to fill a 10G-or-bigger heap, you really can shave a couple of minutes off your run time by giving it a whole lot less. Yes, this is profoundly counter-intuitive.
Lesson: Measure, don’t guess.
Happy Dance! · At this point, I was starting to see run-times down well below fifteen minutes, and profiler output that looked like this (the primitive -Xprof facility gives you a little dump every time a thread exits, which turns out to work really well for me as I’m running through hundreds of ’em):
Flat profile of 29.63 secs (259 total ticks): Thread-58
Interpreted + native Method
0.4% 1 + 0 java.util.Arrays.copyOf
0.4% 1 + 0 sun.nio.cs.UTF_8.updatePositions
0.4% 1 + 0 java.lang.StringCoding$StringDecoder.decode
1.2% 3 + 0 Total interpreted
Compiled + native Method
57.1% 148 + 0 java.util.regex.Matcher.search
17.0% 1 + 43 java.util.regex.Pattern.split
8.1% 0 + 21 clojure.lang.APersistentVector$Seq.reduce
5.8% 3 + 12 java.util.regex.Pattern.matcher
3.1% 0 + 8 clojure.lang.ArraySeq.next
2.7% 0 + 7 clojure.core$conj__3121.invoke
1.2% 1 + 2 clojure.lang.PersistentHashMap$FullNode.assoc
0.8% 2 + 0 clojure.core$next__3117.invoke
0.4% 0 + 1 clojure.core$re_find__4554.invoke
0.4% 1 + 0 org.tbray.paralines$record__54.invoke
0.4% 0 + 1 clojure.core$assoc__3148.invoke
0.4% 0 + 1 clojure.lang.PersistentHashMap$BitmapIndexedNode.assoc
97.3% 156 + 96 Total compiled
Stub + native Method
0.8% 0 + 2 java.lang.System.arraycopy
0.8% 0 + 2 java.lang.Object.getClass
1.5% 0 + 4 Total stub
In other words, the program is spending most of its time using Java’s regex machinery to grind away at all that text input; which is I think what you’d like to see.
This was around 10PM last Friday, and I was crowing exultantly on Twitter, feeling like a rockin’ sockin’ functional-programming’ concurrin’ homoiconic wizard. “Faster than Scala” I uttered, incautiously.
Lesson: Read some Classics in transition from the archaic Greek and learn to fear the punishment for hubris, especially public hubris.
Which Wide Finder Was That? · Because, you see, I was actually running a Wide Finder 1 program, just computing one little statistic, whereas Ray Waldin’s Scala code, and the others on the results page, were working on the benchmark from Wide Finder 2. It’s a lot more complex and computes five different statistics which require looking at a much higher proportion of the data.
Well, I thought, How hard can it be?
Lesson: [Never mind. We’ve all been there.]
Attitude Problem · I now have one, because it seemed like a lot more work than it should be to code this up.
The stats you want to compute could be called “hits”, “bytes”, “referers”,
“404s”, and “fetchers”. So I accumulated results in a map whose keys were
:hits
, :bytes
, :referers
,
:404s
, and :fetchers
; the values were maps whose
keys were the URIs and values the in-progress statistics. So
if you want to add increment the :hits
value for
some URI, you say:
(bump accumulator :hits uri 1)
Here’s the bump
function:
(defn bump [accum stat uri increment]
(let [totals (accum stat)
currently (if-let [total (totals uri)] total 0)]
(assoc accum stat (assoc totals uri (+ currently increment)))))
In Ruby it’d all be a one-liner...
accumulator[stat][uri] += increment
...but then I’d be promiscuously mutating state and condemning myself to single-threadedness. Which is what this series is all about trying to avoid. On the other hand, it says exactly what I want to say without any ceremony or extra arm-waving. Is there a middle way?
Anyhow, here’s my gripe: I started sketching in the code to gather the five Wide Finder 2 stats Sunday while I was watching football. I spent all day Monday, until late in the evening, getting it all to work; admittedly, I’m a relative Clojure newb, but I’m not that terrible a programmer and I think I’m actually fairly quick at learning new things. This is way too long.
The result-merging step in particular gave me heartburn. If you can merge
two maps with Clojure’s elegant merge-with
and you can turn a
list-or-vector into another list-or-vector with map
, why can’t
you run a map
on a map
and get a
map
?!?
Now, the situation wasn’t helped by the fact that Clojure is after all 1.0, so its tracebacks are not models of helpful transparency; also, the code is running in dozens of threads in parallel which made for extra work in tracking down where what provoked that NullPointerException or ClassCastFailure.
Lesson: Some combination of faults in Clojure and faults in your humble correspondent is interacting badly.
Performance ·
Improved, but not terribly satisfying.
The wrong collection of JVM options can drive this sucker
into Garbage-Collection hell. This is painfully obvious when it’s going on;
suppose you’re using a simple monitoring tool like top(1)
while you monitor the output, you
can see it trundling along processing 30 to 50 MB/sec and reporting a CPU
burn rate anywhere between 1200% and 3000%, which means you’re keeping all
the cores and threads pretty darn busy. Then all of a sudden the CPU drops
to around 100% and the output just stops. For like 90 seconds or more. If
you’ve picked a bad combination of options, this happens more and more
frequently until you’re spending way more time garbage-collecting than
actually computing.
[Update:] Someone who wishes to remain anonymous read the first version of this piece had a discussion with me about JVM options and suggested simply taking all of them out, except for -Xms and -Xmx to ensure there’s enough heap; letting it manage its own GC strategy.
This person turned out to be smarter than the tribal lore I’d picked up over the years and around the net, and I’ve managed to stay out of GC Hell for the last couple of days.
On the other hand, after all the profiling and adjusting, my code has grown tons of extrusions and decorations; here’s the bit that actually does the work of processing a line of logfile:
1 (defn proc-line [line so-far accum-1]
2 (if (nil? line)
3 (send so-far merger accum-1)
4 (let [accum-2 (if accum-1 accum-1 (new-accum))
5 fields (. #" " split line)
6 ; [client _ _ _ _ _ uri _ bstring status ref] (. #" " split line)
7 #^String uri-1 (aget fields 6) #^String uri (. uri-1 intern)
8 #^String status (aget fields 8)
9 accum-3 (if (= status "404") (bump accum-2 :404s uri 1) accum-2)
10 #^String bstring (aget fields 9)
11 accum-4 (if (re-find #"^\d+$" bstring)
12 (bump accum-3 :bytes uri (new Integer bstring))
13 accum-3)]
14 (if (re-find re uri)
15 (let [accum-5 (bump accum-4 :hits uri 1)
16 #^String ref (aget fields 10)
17 accum-6 (if (or (= ref "\"-\"") (re-find #"tbray.org/" ref))
18 accum-5 (bump accum-5 :referers ref 1))
19 #^String client (aget fields 0)
20 accum-7 (bump accum-6 :fetchers client 1) ]
21 accum-7)
22 accum-4))))
Boy, that ain’t pretty. At one point, I was pulling out the fields with
fragments like (get fields 6)
but that led to some
XxxArrayAccessor method bubbling to the top of the profile output.
I attempted to replace that with “destructuring”, as in the commented-out Line
6, but that turns out to be just a macro that apparently generates about the
same calls.
I settled on aget
, apparently specialized for fishing around
in Java arrays. But you still have to type-hint what comes out of it.
Feaugh.
In Line 7, you can protect yourself from our old friend 4513622 by either
calling new String
on the uri, or by interning it, as above.
The former causes more memory stress but, if you can spare the memory, runs
faster.
Lesson 1: I’m not sure there’s a clear one here, but it makes me nervous when application code that should be simple is hard to get right, and so far Clojure’s not making this developer happy the way Matz designed Ruby to.
Lesson 2: The Concurrency Tax is too high.
Fallback hypothesis: I’m just Doing It Wrong.
GC Explosion · I accidentally ran an experiment that showed me where a lot of my GC pain was coming from. Reminder: this works by having one Clojure agent read blocks sequentially and handing them off to threads to process, and build those tables of tables with the results. When each thread is done, it sends its table off to another agent to merge.
One time, I had a bug that turned the merger into a no-op, the agent was invoked all right but simply ignored the incoming table. This version ran very fast; not as fast as the simple Wide Finder 1 code I’d been working with earlier, but not bad at all, with a lot less GC overhead. So that code above may be ugly but it actually runs OK.
Here’s the code that does the merge:
(defn merger [current incoming]
(loop [keys (keys current) output {}]
(if-let [key (first keys)]
(let [merge-1 (merge-with + (current key) (incoming key))]
(recur (rest keys) (assoc output key merge-1)))
output)))
It’s iterating all right, but only over the keys of the top-level map, of
which, remember,
there are only 5: :hits
, :bytes
, and so on. Not
much to it, and reasonably idiomatic I think. But boy does it ever generate
garbgae.
I note that in Ray Waldin’s
nicely-performant Wide Finder 2 code he relied on a
java.util.concurrent.ConcurrentHashMap
. I suppose it’s
unsurprising that that would outperform the sort of pure functional/dynamic
code above, but it feels a bit like cheating.
Java Versions and Options · My best times for the full WF 2 run are now around 30 minutes. 29:38 with JDK7, 35:10 with Java 6. The results of tinkering with GC options and generation sizes and so on seem insignificant-to-damaging. In particular then new “G1” garbage collector in JDK7 doesn’t seem appropriate for this problem.
What I Could Do Next ·
Use a ConcurrentHashMap like Ray did.
Bring VisualVM to bear on the problem and really do a deep-dive on what’s happening inside this code.
Try refactoring the app some more.
I dunno. I still like Clojure and stand by my 11 Theses, but that impedence mismatch with my conventional procedural object-oriented programer’s mind grows fatiguing.
Comment feed for ongoing:
From: Erik Engbrecht (Dec 09 2009, at 21:09)
One of the neat things Ray's solution does (and, indeed, I think most of the top performers) is not bother with line splitting. Once you're dealing with fairly optimized code, the performance is really driven by how many times you have to pass over the data. Java is automatically penalized in this case because it does one pass in order to turn bytes into strings, where languages that natively work with ASCII can just copy the bytes in. Being able to use the bytes directly rather than having a big heavy immutable string object probably helps a lot too (there's a another pass - copying the file buffer memory into the string buffer memory).
Anyway, I suggest looking through your code and seeing how many times you'll end up scanning the bytes/chars during your processing. Squishing that down will squish your time.
[link]
From: katre (Dec 09 2009, at 21:09)
I don't know from lisp, but if Clojure runs on the JVM does it make any sense to be using a dedicated Stats class instead of this map or maps of maps thing?
I'd probably invert your model and do a map of URI to Stats, where Stats then has five members for count, referers, etc.
I am hugely interested in all of these Clojure articles, but all they've convinced me of is "If I want to go functional, use more Scala".
[link]
From: Chouser (Dec 09 2009, at 21:09)
Using java.util.concurrent.ConcurrentHashMap is *not* cheating. There are many problems that can't be solved with a ConcurrentHashMap, and the more powerful constructs of refs, maps, transactions, etc. are required. But if they're not, by all means use the more efficient features of java.util.concurrent.
Also, here's an alternate implementation of bump. I don't know if its any faster or produces less memory churn, but at least it's a bit simpler:
(defn bump2 [accum stat uri increment]
(update-in accum [stat uri] #(+ (or % 0) increment)))
--Chouser
[link]
From: Nick L (Dec 09 2009, at 21:09)
Won't that intern cause a memory leak? (Assuming it works the same way as the java intern does)
[link]
From: Gary W. Johnson (Dec 09 2009, at 21:09)
Here are some substantially more idiomatic clojure representations of your functions. I'm not sure how they compare performance-wise, but they're certainly nicer to read IMHO.
[link]
From: Tim Bray (Dec 09 2009, at 21:11)
Testing, testing. Fixing breakage in the comment system.
[link]
From: Tim Bray (Dec 09 2009, at 22:40)
Katre: So thousands of little 5-element hashes would be better than 5 big hashes? Not obvious either way.
Nick: Could you expand? I would expect the opposite, and in fact the memory growth was more controlled.
[link]
From: David Nolen (Dec 09 2009, at 23:12)
Perhaps you should post the entirety of your code now that you've banged on it a little while :) I'm sure the Clojure community would be able to give you not only optimizations but perhaps other ways of tackling the problem.
One issue that you're probably running into is in the creation of many ephemeral objects. I wonder if this could be avoided by using atoms to hold the counts for bumping and not using nested maps?
While Clojure certainly encourages a functional style, Rich has introduced enough constructs where you can code in a concurrency safe yet some what imperative style if that is more natural.
[link]
From: Simon Eubanks (Dec 09 2009, at 23:27)
I'd like to tackle this problem, but I don't want to reverse engineer the problem definition from the code.
Is a definition of the Widefinder problem available? Sample data with expected results will also be nice to have.
[link]
From: Miron Brezuleanu (Dec 10 2009, at 00:49)
About the 'Attitude Problem' section gripe (assoc/assoc): Maybe assoc-in helps getting closer to the desired Ruby syntax?.
[link]
From: Patrick Wright (Dec 10 2009, at 00:51)
I think it would be better for your readers and the Clojure community if you posted your code somewhere, e.g. GitHub, for them to take a look at and study. Posting benchmarks with incomplete code blocks makes it hard for us to analyze where the problems might be, and your blogged results will remain in the blogosphere for eons, giving people the impression that perhaps Clojure is fundamentally broken or hard to work with and tune. I understand if you feel your code is a work in progress, but hey, that doesn't stop anyone else from posting their stuff.
Thanks for the detailed blogging, keep it up.
Patrick
[link]
From: Luke Gorrie (Dec 10 2009, at 00:54)
I think language implementers should think seriously about using disjoint heaps for each thread and copying in between. This is just such an easy way to give programmers a predictable performance model and avoid these voodoo frustrations.
I've been to a few research talks about Java GC and each time my take-away has been "wow, they made the problem much harder than the Erlang guys."
[link]
From: Perry Trolard (Dec 10 2009, at 08:00)
> why can’t you run a map on a map and get a map?!?
Here's the idiom, offered in the spirit of "let me share" not "why didn't you know":
(into {} (map (fn [[k v]] (do-something k v)) my-hash-map)
The (into {} ...) is the trick. The map function works on sequences, & will coerce anything it can into a sequence, which in the case of (hash) maps is a sequence of [key value] pairs. The (fn ...) operates on these pairs, returning another, modified pair for each, so the return value of the map function is the sequence
( [k v] [k v] [k v] [k v] ... )
So you've processed the original map as a sequence (here's the unifying sequence abstraction) and want to turn the thing back into a map: enter "into", which successively adds (with "conj") items from a sequence into a supplied datastructure. And
(conj {} [:key "value"]) => {:key "value}
(into, BTW, features the recently added transient optimization [http://clojure.org/transients])
I've really enjoyed your series on Clojure. I'm a longtime ongoing reader & a sometime Clojure enthusiast, so I'm happy for the intersection.
[link]
From: Brenton Ashworth (Dec 10 2009, at 13:44)
I love this project and can't wait to see what your final conclusions are.
Even though I don't completely understand everything that's going on in the proc-line function, I have a suggestion for how to clean it up. In your post "Idiomatic Clojure" you had a version of the batch-pmap-wide-finder code that you liked with functions tally, count-lines and find-widely. I am doing some log processing (not Wide Finder 2) and used this as a template to get started. Here are the good parts:
(def access-log-re #"[\"]GET\s(.*)\sHTTP/\d[.]\d[\"]\s(\d\d\d)\s(\d*)")
(defn tally [line]
(if-let [[_ hit code bytes] (re-find access-log-re line)]
{:lines 1
:total 1
(keyword code) 1
:hits {hit 1}}
{:lines 1}))
(defn count-lines
[lines]
(apply deep-merge-with + (map tally lines)))
(defn log-file-lines [filename]
(line-seq (reader filename)))
(defn process-log-file [filename]
(count-lines (log-file-lines filename)))
(defn log-files [dir]
(map #(.getAbsolutePath %)
(filter #(.startsWith (.getName %) *file-prefix*)
(file-seq (java.io.File. dir)))))
(defn process [project]
(apply deep-merge-with +
(map process-log-file
(log-files (str *root-dir* project "/server/")))))
In tally I use the regular expression to both find what I am looking for and also do a destructuring bind to extract the fields. I then create a nested map that has the stats for a single line.
In count-lines and process I use deep-merge-with from clojure.contrib.map-utils to merge the maps into a single map. My process function is merging results from multiple files but you can do the same type of thing in merging the results from multiple threads.
I'm not sure how this would effect performance.
[link]
From: David Terei (Dec 10 2009, at 17:41)
Would someone be able to point me in the direction of the sample data set? I would like to give Wide Finder a try.
[link]
From: Jim Robinson (Dec 11 2009, at 16:08)
I'd be very curious whether or not
pinning the memory size using both
the -Xms and -Xmx parameters set to
the same large value affects the
performance. You mention the cost of
repeated reallocs to grow the heap,
and so what happens if it just needs
to perform one large malloc?
Jim Robinson
[link]