This is the seventh progress report from the Wide Finder Project, in which I report on the work of others and turn this into kind of a contest. I’ll be back later with one more whack at Erlang, and then probably move on to other Wide Finder options. [Update: If you’re working on this but I missed you, sorry; holler at me and I’ll make sure you get covered.]
Pi Calculus · Since I started this thing out with a little Ruby program, let’s start this report with work from Assaf Arkin. In Pi-cture this: Pi-calculus, Ruby and WideFinder, he introduces the π-calculus and a Ruby library that implements it.
He follows up with WideFinder, Ruby Pic, and Scaling Up, Out and Away, running the code and actually measuring the results.
Lisp · Well, it was only a matter of time. See wide finder in lisp by Irate Nate. I note that the Lisp code, while not as compact as the Ruby, is a lot more compact than the Erlang.
C++ · The Machine Elf writes Wide finder, parallelism and languages; his C++ implementation is actually shorter than Steve Vinoski’s first cut in Erlang, which I find amusing.
He runs his test on ancient creaking SPARC boxes, and brings MPI into play, which seems reasonable enough. He struggles with the fact that the Wide Finder can be (but need not be, I argue) I/O-bound. Still, I recommend reading this piece, in particular the closing Conclusion paragraph, which includes some first-rate lateral thinking about the specifics of this problem.
Parrot · At linux devcenter, Kevin Farnham writes Open Source Thoughts: Parrot and Multicore, which doesn’t actually reference Wide Finder at all, but is clearly thinking about the same space. He proposes that the Parrot VM (specifically aimed at being the substrate for Perl 6, but with loftier dreams) might be a good basis for progress in the multi-core world.
Python · Santiago Gala contributes Python, Erlang, Map/Reduce. His Python code is one line shorter than my Ruby version and, he claims, more elegant. Check it out and make up your own mind. He has no performance data yet.
Erlang · Over at Caoyuan’s blog, there’s Tim Bray's Erlang Exercise on Large Dataset Processing, which provides a huge glob of Erlang code and better numbers than I’ve got so far. Also he points to a contribution from Per Gustafsson with even better numbers.
But for my money, the most interesting Wide Finder Erlang work is being done by Steve Vinoski. His second report from the coal-face is More File Processing with Erlang. It’s long and dense and meaty and worth reading.
The Contest · They tell me I’ll soon be able to get my hands on a pre-release T2-based server. I will make a serious effort to run as many of these Wide Finders as practicable on that machine with lots of gigabytes of real data, and will report on results. Bear in mind that the T2 has a top cycle speed of only 1.4Ghz, plus remarkable I/O and memory subsystems, plus 8 cores with 64 hardware threads. My expectation is that the Wide Finder won’t be I/O limited on this box, and all these parallelism stratagems people have been grinding away at will pay fruit.
But there’s only one way to find out.
And it occurs to me that anyone who can make this puppy fly on the T2 will have done Sun a real favor, and deserves a big thank-you; I’ll see what I can arrange.
Comment feed for ongoing:
From: Steve Vinoski (Oct 01 2007, at 19:27)
Thanks for the compliment, Tim. I guess this means more sleepless nights as I try to improve the 9.8 seconds on the full-size dataset. Gee, thanks! ;-)
I look forward to learning the results of running my tbray5.erl approach on that T2. That version definitely makes significant use of multiple cores. However, it will likely take some tweaking to figure out how many processes to launch and how much to read from the file at a time.
A slight correction on your posting, BTW: I still use C++ quite a bit and so am happy to see a C++ entry, but The Machine Elf's first C++ program is actually larger than my <em>first</em> Erlang program, at 48 vs. 31 non-blank lines, respectively. Also, the second C++ program is only slightly smaller than my <em>second</em> Erlang approach, which was 61 non-blank non-comment lines if you take out the scaffolding I left in for timing and testing. Considering that neither of those C++ programs performs any concurrent computing whatsoever, I don't think the comparison is useful -- or amusing. ;-)
[link]
From: CB (Oct 02 2007, at 03:52)
I've put a few Java versions together on a blog: http://unintentionalobjectretention.blogspot.com/2007/10/widefinder-and-java.html
[link]
From: Tim (but not THE Tim) (Oct 02 2007, at 11:52)
I understand the usefulness of parallelization of the search code, but I am having a hard time getting my head around how the I/O to the files will be parallelized. I definitely believe that you can search a file in chunks - assuming you're using a fixed-block architecture, you can divide the file into ranges for example - but how will you deal with rotational delays, head movement, and all of that? Will this work better on some parallel-striped version of the sequential log dataset and will your program(s) have to understand the I/O at that level?
[link]
From: Colin (Oct 02 2007, at 15:38)
Analysing the sample data showed that approx 38% of the entries are for ongoing.atom and 16% for ongoing.rss Nothing else was much over 3%
If pattern matching is not particularly fast in Erlang then hard-coding tests for those two entries prior to attempting to pattern match may speed things up a bit.
[link]
From: Thomas (Oct 02 2007, at 16:56)
I don't know it well enough to give it a go, yet, but I suspect a Scala implementation could be made fairly fast and scalable without lacking in elegance.
[link]
From: Pete Kirkham (Oct 03 2007, at 13:21)
The C++ code isn't intended to be concurrent, nor as terse as the ruby - it was posted under the heading 'Measuring the problem' because it's a experiment to see where the load is on the system is doing the string matching and IO parts of the problem, and so perhaps suggest what should be done to mitigate that load; before you work out how to add concurrency, first you need to ask the the question of whether adding extra CPU concurrency would even effect the total speed of the result, and if it would where in the problem to add it.
I've tried to extrapolate from the spike test to what would happen on a system with 64 1.4GHz Sparc64 processors with a IO data rate equivalent to 'Thumper' at http://www.tincancamera.com/blog/2007/10/some-fingers-in-air.html
Pete
[link]
From: eric (Oct 04 2007, at 21:23)
Not to derail the conversation any more, but I've been testing the sort of system where the wide finder would be useful (db machine, 2x quad xeon, many fast sas drives) and I'm struck by the disparity in IO capability vs single core performance. Particularly, tar -xvzf archive.tgz big_dir limps along on one core.
While some have questioned the utility of scanning logs quickly, I'm pretty sure that tar and gzip are a pretty common need.
[link]
From: Caoyuan Deng (Oct 05 2007, at 09:11)
Hi Tim,
I rewrote my solution to get the most expensive procedure 'scan_lines' in parallel, and achieved a better scalable result. On my 2.0G 2-core MacBook, it takes about 13.778 seconds with no-smp, 7.787 seconds with smp enabled (for a 200M data file, with about 50,000 processes spawned). The 2-core smp result achieves 177% faster than no-smp result. But I'm not sure how it achieves on an 8-core computer.
The code is at:
http://blogtrader.net/page/dcaoyuan?entry=tim_bray_s_erlang_exercise1
[link]
From: evan wider (Oct 06 2007, at 15:42)
have you seen this? wow!
http://effbot.org/zone/wide-finder.htm
[link]
From: Ilmari Heikkinen (Oct 06 2007, at 16:33)
I've been hacking on some Wide Finder implementations in C, of all things (hey, just wanted to know how fast it CAN go), you can see my results at http://fhtr.blogspot.com/2007/10/wide-finder-c.html
In a nutshell: 60ms to read 200 megs from page cache, 380ms to read and grep it, Ruby version takes 1200ms. So, I'd use several Ruby processes instead. With IO.popen for coordinating and merging results. Sure, the C version is 3x faster, but it took around a hundred times more effort to write. My C is horrid, though :>
And, Pete Kirkham's grepping code in C++ is still faster than my C version (350ms vs. 380ms, http://www.tincancamera.com/blog/2007/10/poitroai.html )
[link]
From: Chad Brewbaker (Oct 08 2007, at 22:16)
The RubyMPI paper was submitted, complete with wide finder code:
http://www.public.iastate.edu/~crb002/widefinder.rb
[link]
From: Joshua E Cook (Oct 09 2007, at 12:05)
In this Python script that I wrote, I borrowed a couple of tricks from others and wrapped everything up in a sequence of generators and iterators:
http://joshuacook.com/snippet/python/finder-1.728.txt
It uses only a single thread, but completes the analysis of a 1M-line log file in just over 1.7 seconds, measured with the shell time utility on my 2.0 Ghz Core 2 Duo MacBook with 1 GB RAM.
One big improvement I was able to make over my own earlier tries was to give a buffer length with the call to the built-in open(). I hadn't seen that in the other Python implementations.
[link]