Last fall, I ran the Wide Finder Project. The results were interesting, but incomplete; it was a real shoestring operation. I think this line of work is interesting, so I’m restarting it. I’ve got a new computer and a new dataset, and anyone who’s interested can play.
The Story Thus Far · I’ll use this entry as a table-of-contents for the series as it grows.
The Problem · The problem is that lots of simple basic data-processing operations, in my case a simple Ruby script, run like crap on modern many-core processors. Since the whole world is heading in the slower/many-core direction, this is an unsatisfactory situation.
If you look at the results from last time, it’s obvious that there are solutions, but the ones we’ve seen so far impose an awful complexity cost on the programmer. The holy grail would be something that maximizes ratio of performance increase per core over programmer effort. My view: Anything that requires more than twice as much source code to take advantage of many-core is highly suspect.
Last Time & This Time · There were a few problems last time. First, the disk wasn’t big enough and the sample data was too small (much smaller than the computer’s memory). Second, I could never get big Java programs to run properly on that system, something locked up and went weird. Finally, when I had trouble compiling other people’s code, I eventually ran out of patience and gave up. One consequence is that no C or C++ candidates ever ran successfully.
This time, we have sample data that’s larger than main memory and we have our own computer, and I’ll be willing to give anyone who’s seriously interested their own account to get on and fine-tune their own code.
The Set-Up · This time, the computer is a T2000, with 8 cores and 32 threads. It’s actually slower (well, fewer cores) than the semi-pre-production T5120 I was working with last time, but it’s got over 250G of pretty fast disks, which I’ve built a ZFS pool and one filesystem on.
I’ve also staged all the ongoing logfiles from February 2003 to April 2008 there, 45G of data in 218 million lines.
It’s Internet-facing, with ssh access and port 80 open (not that I’m running a Web server yet).
Want to Join In? · I’ve set up a WideFinder Project Wiki. The details of how to get started are there. For now, anyone with a wikis.sun.com account can write to it, which I’m hoping will be an adequate vandalism filter.
The First Step ·
Before we start coding, we need to agree on
the
Benchmark; the 13-line Ruby program was an instructive target for Wide
Finder 1, but several people pointed out that this could be done with an
awk
/sort
one-liner, so something a little more
ambitious might be appropriate. I’ve made a couple of initial suggestions on
the wiki.
Comment feed for ongoing:
From: Jonathan Ellis (May 01 2008, at 12:50)
Good luck coming up with a benchmark that will be CPU-bound rather than IO-bound.
[link]
From: Lenny (May 01 2008, at 12:50)
I'm not sure how to read the Bonnie report on the wiki, but I think this problem depends on the ability to parallelize I/O.
Can you describe the RAID setup? The server is probably striping or mirroring files across a few disks, which lets them be read in parallel. Is it possible for you to run the benchmark on the same CPU with the data only on 1 disk?
[link]
From: someone (May 01 2008, at 13:42)
The behavior of the original Ruby script suggests that any line in the log can be processed in any order, and therefore is embarassingly parallel the way raytracing and evolutionary computation optimization heuristics are.
This means a MapReduce-like infrastructure with as many machines and disks as possible is ideal. Scale horizontally for increased performance.
It seems like you want someone to take conventional programming languages and paradigms and turn them into some kind of turbocharged data processing platform or that you're looking for some kind of free lunch where the epic slowness of disks and the frustrating lack of malleability in popular programming languages is somehow completely sidestepped. This from someone whose favorite languages are Java and Ruby, languages whose main ideas and idioms were well-established by the 1980s.
[link]
From: Robin (May 01 2008, at 13:50)
Folks can use the same hardware and verify results if you use AWS EC2 instances (1-8 threads) as the hardware.
[link]
From: Charles Oliver Nutter (May 01 2008, at 14:14)
I second the IO statement, but with enough memory the benchmark could be made to just read it all in. What's a few GB of data in-memory these days?
Otherwise a mathematical test would be interesting. I'm not sure the benchmark has to be necessarily practical to show the effect.
[link]
From: Erik Engbrecht (May 01 2008, at 16:38)
Couple comments:
(1) A more compute bound benchmark would be a good idea, I second Charles Nutter's comment about perhaps a math one.
(2) There should be some unicode files to parse. Environments that use unicode (like Java) pay a big penalty when processing ASCII files compared to environments that use ASCII.
(3) I would be writing this on the wiki but it crashed after I created a screen name and now crashes every time I try to log in.
(4) There should be some testing with the system already under load to see if the programs are well-behaved.
[link]
From: Avi Bryant (May 01 2008, at 23:22)
I also think doing this on a Large EC2 instance would be a good idea - not only would this standardize the test hardware, but a given implementation could be packaged up as an AMI so that installation/compilation issues didn't come up.
[link]
From: Nick Johnson (May 02 2008, at 04:10)
Just to reiterate and reinforce what others have said: I think your scope is wrong. If you want to provide scope for novel solutions, your data needs to be spread over multiple machines, or at the very least, multiple disks (without raid getting in the way of reading efficiently from each disk in parallel).
[link]
From: Erik Engbrecht (May 02 2008, at 09:25)
Nick - I think the scope is pretty close to right. Tim is talking about a really common type of task here. You really need a more complex task than counting occurrences of a line matching a pattern to demonstrate really using lots of cores, but not that much more complex.
The challenge here is to make Moore's law continue to apply to the speedup of a simple task while keeping it simple. That's a much more interesting challenge than just making something as fast as it can be on a given hardware configuration.
Tim - I have another suggestion. See if you can find an old single processor or dual processor Sun box with cores about the same speed as the T2. These programs should scale - and scaling goes both up and down. A program that requires 8 cores and 32 hardware threads to be performant is a non-starter.
[link]
From: Erik Engbrecht (May 02 2008, at 12:31)
Side comment - there is no RSS feed for changes to the wiki - or at least if there is I can't find one. That's rather annoying.
[link]
From: Ed Borasky (May 03 2008, at 19:03)
Yeah ... not having an RSS feed on the wiki is annoying, since that's where I've been making my comments. I'll apply the DRY principle and not repeat myself.
[link]
From: Jeremy Dunck (May 05 2008, at 14:45)
Is it taken as a given that we're going to find, in general, significant gains in parallelism?
Check out Knuth's comments on parallelism here:
http://www.informit.com/articles/article.aspx?p=1193856
Virtualization and process-as-abstraction may help, but not everything is embarrassingly parallel.
[link]
From: Seth Ladd (May 06 2008, at 06:08)
How is this not an IO problem? Why can't we use Hadoop here?
[link]
From: Edward Berner (May 09 2008, at 23:32)
Nice. I look forward to watching this play out. Maybe I'll even scrape together an entry this time....
[link]
From: ykud (Jun 01 2008, at 10:54)
For a Solaris SMP a map\reduce variant on Phoenix can be worth trying.
[link]
From: Tony Fisk (Jun 01 2008, at 18:57)
I don't have time for this, unfortunately.
A lot of clever folk have quite probably thought of all this already, but one suggestion I can make is that the task you have set cannot proceed faster than the 'bottlenecks' in it. The bottlenecks may or may not be amenable to parallelism (disk I/O being a good example). Assuming that they are, however, then it is a matter of load balancing: assigning a proportional number of threads to the most intensive steps so that data is processed at each step at about the same speed.
A useful trick would be to get all those threads to do the load balancing themselves. One way to achieve this is to define each step in the process as having an input queue, and one or more output queues (acting as an input to the next step). If you were to arrange those queues as a list sorted by the number of data entries to process, and this list was re-sorted each time an entry was removed or added, then threads could run the next task according to how many entries were currently awaiting processing, and so would automatically concentrate on the bottlenecks of the moment.
[link]
From: Taylor (Jun 02 2008, at 09:09)
comments are closed on the item, but the airport pics are showing up today..
"shrine to Our Lady of Europe"?
With an EUC flag? I find it a bit startling to see religion, idols, and flags all mixed together like that.
[link]
From: Andrew Hobson (Jun 06 2008, at 14:19)
I'm not qualified to render analysis on the link and whether it applies here, but I found it interesting that LtU posted an a paper on Map-Reduce-Merge at about the same time.
http://lambda-the-ultimate.org/node/2836
[link]