[This is part of the
Wide Finder 2
series.]
I have now done the first “official” run of the naive Ruby implementation
of the benchmark. There is some discussion of the code
here. The
benchmark is described, and the naive Ruby code provided,
here.
I’ve started a results page
here.
There are already ten eleven other people now with accounts on the Wide Finder
machine, and I know there’ve been results that are hugely better than this
first cut.
Read on for a couple of notes on this first run.
The combination of using the SPARC-optimized Ruby and eliminating the huge sort shaved over four hours off the implementation time, but the results are still horrible: 25:24:41 elapsed time, with zero use of parallelism. The only thing that’s nice is the size: 78 lines of Ruby.
Can you do better on the time without exploding the code size?
An amusing note on the code. I needed a method to select the ten highest values (and corresponding keys) out of the hash table without sorting all of them. The code isn’t rocket science, here’s the skeleton:
class Hash
def top(howmany)
...
end
end
# ... later on
@clients = {}
...
@clients[client] += 1
...
top_clients = @clients.top(10)
I was showing it to a non-Rubyist and he said “WTF is that class Hash
stuff?!” I explained and added, “Hmm, I guess I could have just said
def top(hash, howmany)
”.
My questioner rolled his eyes. He had a point I think; Ruby makes you do
some odd things.
Comment feed for ongoing:
From: David Adams (Jun 05 2008, at 17:48)
I think the approach of adding your method to the object has some merit, both for general readability--the object.method grammar is all over the place, why reverse it just because you wrote a custom method?
It's also an easy way to enforce type safety. If you define top in Hash there's no way you can call it on the wrong kind of object, and you don't have to remember (or guess, when reading) what order the arguments go in.
[link]
From: Scott Johnson (Jun 05 2008, at 18:34)
It seems that it would make things more readable to use a subclass of Hash, but I'm not a Rubyist either.
[link]
From: Simon Willison (Jun 06 2008, at 00:20)
As a Python programmer, the cultural tendency towards monkey-patching and modifying built-in classes is the single thing in Ruby that really makes me uncomfortable. Adding a method to Hash is basically adding a global variable, and it seems to me that it has horrible consequences for maintainability. What if you later include another library that has its own idea about what Hash.top should do? You'll introduce a mystery bug which could be pretty hard to track down.
At RailsConf 2006 there was a keynote session on exactly this problem - that monkey-patching in libraries was introducing real bugs in people's code. It's obviously not just an imaginary problem. Here's a brief write-up of that session: http://woss.name/2006/10/04/railsconf-europe-2006-keynote-jim-weirich/
The other language I use where this is commonplace is JavaScript, and it's interesting to note that most of the JS libraries (with the notable exception of Prototype and mooTools) deliberately avoid modifying JavaScript built-in types. It's no coincidence that Prototype and mooTools are tho only libraries which can't safely be used on the same page as each other.
[link]
From: malcontent (Jun 06 2008, at 01:30)
Ruby lets you but does not force you to change the base classes.
This allows you to type 2.days.from_now instead of Date.DateAdd(Date.now(), 2, Date.DAY) or whatever.
[link]
From: Mauricio Fernández (Jun 06 2008, at 03:34)
I think the #top method you wrote doesn't belong in Hash (if the code is meant to be reused) because it is not generic; the comparison logic is specific to this application:
keys.sort! {|k1, k2| diff = self[k2] - self[k1]; (diff != 0) ? diff : k1 <=> k2 }
It assumes that the values are numeric, and imposes an unusual (value, reverse key) order.
You would expect Hash#top to return the top elements by key, value, or [key, value], but never by value and reversed keys.
In practice, it would be a good idea to have #top sort according to [key, value] (i.e., placing #top in Enumerable, as a generalization of #max), and add Enumerable#top_by as well as Hash#top_by_key, Hash#top_by_value and Hash#top_by_pair (mirroring #each_key, #each_value an #each_pair).
PS: are you still giving away accounts? I have some code waiting to saturate the I/O bandwidth of that T2000... ;)
[link]
From: James Aylett (Jun 07 2008, at 06:27)
I seem to have an earlier copy of the ruby benchmark, which doesn't have Hash#top, but instead does it inline:
keys_by_count = hash.keys.sort_by{ |key| -hash[key] }[0 .. 9]
This is much nicer: it's shorter, (I think) more readable, and avoids the righteous ire of people who don't want you contaminating the Hash namespace :-)
Unfortunately it gives different results: the sorts stabilise differently. (Of course this may not be an issue with the data set we're working on, but it's an issue with the one I mocked up.)
Also, I don't think Hash#top is actually a truly naive implementation, since the sort is doing some sort of elite set processing to avoid having to sort the entire contents of the hash. Won't this result in significantly greater complexity when parallelising or distributing the processing than a simple sort would?
[link]
From: James Aylett (Jun 07 2008, at 10:40)
In fact, I think it's worse than that. It seems (unless I'm being utterly dense) that (for instance) ruby's Hash and python's dict will give back keys in a different order. Because the keys aren't in a consistent sort order, we're going to get differences in output unless everyone emulates ruby's behaviour, because the check to consider new elements for the candidate result set in Hash#top isn't using the same comparator as the sorting of that candidate set (the former just checks the frequency/total bytes, but the latter also checks the key name, meaning that stability of the former is dependent on the order of Hash#each).
Actually, is even the method of the elite set right? Once you've got something in there, you'll never get anything less than it - so if the first entry out of Hash#each has too high a frequency, you won't make 10 entries in the output. (Or I don't understand ruby well enough.)
[link]
From: Lennon (Jun 08 2008, at 15:35)
A more OO *and* namespace-purity-preserving means of scoping the #top method would be the following:
def @clients.top(count)
...
end
Singleton classes are often one of the last pieces of the Ruby metaprogramming canon to be grokked by new Ruby developers, but are also the answer to many of the recurring questions about monkeypatching vs. global method definitions.
Of course, they also look weird to folks who expect classes to be global and static, but we can't expect to make everyone happy all of the time, right?
[link]
From: Graeme Mathieson (Jun 25 2008, at 03:37)
@Lennon That's ... genius. I think. It matches perfectly with what I've understood to be the ethos Kent teaches in Test Driven Development by Example:
* You've got your red bar.
* To get to green, do the simplest thing that could work, which is probably just to implement the contents of #top inline.
* Refactor. In this case, you're thinking object responsibility and want to make that hash responsible for finding its top n. But the only place the code is currently needed is on that instance, so let's just define it on the instance (really keeping the behaviour close to the data, eh?).
So you get all the bonuses of writing your intentions ("@count.top 10" which is lovely and clear what you're trying to achieve) while not causing any unwanted pollution to Hash. The only problem is that you have to define #top somewhere, which is potentially going to interrupt the flow of your code... Hmm, maybe define it in a module, so at least it's just "@count.extend WF2::Topper"?
[link]