This is the fourteenth progress report from the Wide Finder Project. I still have a bunch of Wide Finders to run, and for the time being I’ll keep trying to run what people send me; we can’t have too much data about this problem. But some conclusions are starting to look unavoidable to me.

First off, you should stop reading this and visit Eigenclass to read Mauricio Fernandez’s Conclusions about Wide Finder, C++, OCaml, JoCaml, Erlang and friends. While I don’t buy his use of the word “interpreted”, I find it hard to argue with his conclusions about languages and implementations (but visit his comments). My conclusions are (I think) fairly language-independent, and are about technical strategies; what works and what doesn’t and cost-benefit trade-offs.

Bypassing the Filesystem · This is not a new finding. When I/O is on your critical path and you’re on a Unix-like system, you quite likely need to be doing mmap() rather than read() if you can. In many circumstances, the path through the OS’ VM-paging subsystem is quicker than through most filesystem implementations. Multics, an influence on most modern operating systems, had only memory-mapped I/O.

The downside to this is that files are a very natural lens to look through at persistent data. Every language has gets() or equivalent for a good reason.

Also note that the advantages of memory-mapping are overwhelming when you’re doing input, and your program semantics don’t require random update. It would be dangerous to extrapolate too much past that scenario.

I think it’s OK for programmers to want to tell the system “Run this code on each line of this data.” If it’s more efficient to do that in a memory-mapped way, the infrastructure ought to just make it happen.

Parallel the I/O · While Wide Finder is not strictly “I/O-bound”, it’s certainly got a big I/O component. If you want good performance on parallel hardware, you need your I/O concurrent.

Combining memory-mapping and concurrency yielded fairly mind-boggling results. Check those Bonnie numbers; one core on this particular box maxes out around 160M/second of block input, so just straightforwardly block-reading the data would burn over 5.5 seconds; yet the JoCaml version ran in less than two.

So while this is obviously worth doing, the same issue arises; programmers like being able to say “Run this code on each line of this data” and that’s reasonable. So what we want is a really cheap way for them to add "... and I don’t care what order you run ’em in, or if a bunch run at once”.

Bypassing Regular Expressions · The Wide Finder programmers did all sorts of pattern-matching voodoo to dodge the overhead of regular expression processing. I don’t have a good quantitative handle on how much of the global speedup this bought, but I’m not convinced that it’s a good idea.

Regular Expressions are incredibly compact and expressive and in my opinion a good tool that programmers should use to drive the logic of processing text. The articles in Beautiful Code (not just mine) make it obvious that thoughtful developers obsess about regexes, and there’s a reason for that. They also make the case, which I’ve heard repeatedly, that there is plenty of scope for optimization in currently-popular regex engines.

I suspect we may have been pushed over the edge of this slippery slope by Erlang’s lousy default regex engine. And it’ll come down to a trade-off: If that fast JoCaml code could be simplified by going regex at a cost, say, of a factor of two or three in performance, that might be a good trade-off.

Computation · The Wide Finder does require you to accumulate results per unique article; doing this concurrently in parallel is going to require some thought. It’s difficult in that there’s no real distinction between readers and updaters; the data structures are write-only for most of the program’s run.

While this actually didn’t seem like a big deal to the Wide Finder implementors, the computation being done here is about as simple as it gets. So I think this is the tough end of the problem: What kind of a general-purpose computation framework do you give developers doing this kind of work? And to be fair, it’s not one that Wide Finder does much to explore.

Up till a couple of weeks ago, I would probably have argued that the Erlang model was the best I’d seen for cracking that nut. Now I think that that join calculus, as exemplified in JoCaml, deserves a serious look.

Conclusion · I still think that the ideal-world solution looks something like this. But I’ve sure learned a lot about how we might get there from here.



Contributions

Comment feed for ongoing:Comments feed

From: Joe Cheng [MSFT] (Nov 13 2007, at 14:52)

Maybe I missed it--did you mention whether your benchmarks were with a cold disk cache or hot?

[link]

From: Tim Bray (Nov 13 2007, at 15:09)

Joe, good point. I updated the WF-XI: Reults piece to note that I always ran the programs several times in a row, often interspersed with other WF code, in an effort to warm up the caches. And I'm pretty well the only person using that machine, so results should be comparable.

[link]

From: Daniel (Nov 13 2007, at 15:48)

Looking at performance vs lines of code, the most prominent conclusion I have is that concurrency is still very hard, despite occasional claims that tool X or language Y makes it really easy.

Given the intent of the competition was to explore whether and how multiple cores can be exploited for some specific task, I guess that's not good news at all: There seems to be no serious concurrency gains below 10x code size, and the winner is over 20x, and others are worse still.

So you can use all those cores, but be prepared to spent a lot of (developer's) time on it!

[link]

From: Fredrik (Nov 14 2007, at 00:13)

"If that fast JoCaml code could be simplified by going regex at a cost, say, of a factor of two or three in performance, that might be a good trade-off."

That would give you wf-6.py program, right? (which does use a plain RE, and no other search tricks. not even a sublinear prefix check.) For those who think that wf-6.py is bloated, read this: http://mtrr.org/blog/?p=91

[link]

From: Jacek (Nov 14 2007, at 09:29)

I haven't seen this mentioned, and I haven't reviewed the various pieces of code to see if anyone does this, so if this is done, just ignore this... 8-)

In your "goal" description, you parallelize the each_line command, but all the synchronization could be avoided if all threads just counted their own subtotals, and when they finish, they can give it to a single thread, so far with no locking, to compute the total totals and select the top-ten results.

In other words, if you blindly parallelize each_line, you'll require locks for accessing the counts associative array; but because addition is commutative (i.e. a+b=b+a), we can keep a separate associative array in each thread, computing the totals at the end. I don't know if a compiler could do this optimization, and thus eliminate a lot of the locking.

Even if the submitted implementations do that, it should still be mentioned as a way to really make the code parallelizable.

Or it's just too trivial and everyone would naturally think about this...

[link]

From: Adrian Thurston (Nov 14 2007, at 12:54)

Jacek: the ragel solution I wrote does this. It spawns workers then afterwards merges the results. One thing I would like to try out if I had a many-core machine would be to divide and conquer the merging.

http://groups.google.com/group/ragel-users/browse_thread/thread/45c5b1448771df7

[link]

From: Mike Parsons (Nov 15 2007, at 04:42)

Hey Tim ... I've really enjoyed this series. Some great contributions and creative "hacks". I also find it interesting when compared against this ... http://www.langpop.com/

[link]

author · Dad
colophon · rights
picture of the day
November 12, 2007
· Technology (90 fragments)
· · Concurrency (76 more)

By .

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!