[This is part of the Wide Finder 2 series.] A bunch of people have requested the sample data. Meanwhile, over at the wiki, there’s a decent discussion going on of what benchmark we should run.
I particularly liked the comments from MenTaLguY, Preston Bannister, and Erik Engbrecht. (Unfortunately, given the test data we have, I don’t see us being able to follow Erik’s “non-ASCII” suggestion).
Of the suggestions on offer at the benchmark page, I prefer the last couple. Session statistics are a little trickier, and I suspect a bit more resistant to a brute-force Map/Reduce approach. And the “normal HTML report” idea is very much in the spirit of Wide Finder 1, only with a bit more computation involved, perhaps enough to keep it from being a pure parallel-I/O benchmark.
Your thoughts? I suppose I’m signed up to build a reference implementation for verifying output.
Comment feed for ongoing:
From: Alastair Rankine (May 12 2008, at 16:39)
Tim, it seems to me that specifying the input data up-front unnecessarily limits the range of interesting benchmarks. In other words: why start with your logfile data? Why not tailor the input data to the benchmark, instead of the other way around?
For example, a Wikipedia snapshot could provide a large set of input data for UTF-8 based benchmarks.
[link]
From: Erik Engbrecht (May 12 2008, at 18:05)
IMHO you can do Unicode in the small. Think of it as "Does your implementation support Unicode, yes or no? If yes, prove it." I don't think you need to replicate all of the tests in all their enormous glory - just enough to demonstrate a capability. There are people who care about Unicode, there are people who don't, and there are people who just want to understand what they're getting and what they're loosing.
[link]
From: Matt Brubeck (May 12 2008, at 19:22)
Speaking of HTML, a benchmark that actually involved fetching and parsing HTML pages from the web would be interesting. Compared to mere logfile parsing and analysis, it might actually be CPU-bound. And adding some remote requests would be a good test of concurrent/asynchronous connection management.
[link]
From: Daniel (May 13 2008, at 14:23)
From the wiki: "Generally it would be good to have one problem that is easily partitioned across processes/machines vs one that has interdependencies between lines e.g. navigation graph above."
Heh... is that an attempt of defining a problem to fit the solution? I personally think the graph problem would be much, much more interesting, and also more in the spirit of the project, given that: 1) They are usually easy to formulate sequentially, 2) large graphs tend to be computationally expensive and would thus actually benefit from multi-core hardware, 3) they should be relatively resistant to gaming the results or mere implementation tricks. (Or, actually, any trick that would speed up such a problem is more likely to also have real world applicability, somewhere.)
From the article: "(Unfortunately, given the test data we have, I don’t see us being able to follow Erik’s “non-ASCII” suggestion)."
Uh, why not? All you have to do is stick an Umlaut or a Chinese character into the title of an upcoming article so that it shows up in the URI. Naturally, that would still mean the data is ASCII dominated, but it would prevent contestants from simply ignoring Unicode altogether. Alternatively, you could simply fake a single entry, e.g. using following, rather wonderful contraption: http://www.revfad.com/flip.html
If you want to be pedantic, you would also have to define the problem such that the given URI would somehow show up in the end result.
[link]
From: Ray Waldin (May 17 2008, at 11:23)
I like the normal HTML report idea, but maybe it could be made a little more interesting (and useful!) by allowing minor runtime customizations. This would leave the benchmark a little open ended until the official runs, where a standard customization can be applied across the board. For example, in the original WideFinder the records were filtered and counted using a by regex pattern against the request field. Maybe the pattern and field are left unspecified until the official run.
An extreme, but unrealistic, benchmark would be to invent a full blown DSL for querying request logs which each solution was required to implement. It might easier for now to just require regex based filtering and grouping of any of the log record fields, and have the output standardized to Top 10 URLs, Top 10 User Agents, etc.
[link]