This is the fifth progress report from the Wide Finder Project; an aggregation of what other people have been saying.
Misguided? · The opinion continues to be heard, here and there, that the project is misguided; that a well-tuned regexp engine will run at I/O speed or better, thus the whole thing is I/O bound, and not suitable for speedup via parallelization.
The numbers I see coming out of the Ruby and Perl versions of the code lead me to disagree, so I’m going to keep poking around. And even if this point of view turns out to be right, that in itself would be a valuable finding: “Typical sequential logfile-processing tasks are not particularly helped by the use of many-core technologies.”
Other Approaches · Let’s see; Martin Probst has been working with Scala and Actors; see Scala Actors - no speedup? and Wide Finder in Scala.
Bryan O’Sullivan took a run at the problem with Haskell, written up in What the heck is a Wide Finder, anyway?
In my comments, Tom White made the obvious connection to MapReduce and Hadoop, and added some other interesting remarks.
Caoyuan Deng tried a different approach in Erlang, and then Per Gustafsson worked on it too.
See Also · Tom Preston-Werner is thinking about Calling Erlang from Ruby. Patrick Logan discovers Apparently Fast Erlang File Read and Regexp. The Semergence blog has I Second That Emotion, about Erlang file I/O.
Comment feed for ongoing:
From: CB (Sep 28 2007, at 05:04)
Java too uncool to mention? Here is the 2 thread version that gets about 20% speedup anyway:
public class WideFinder2ThreadsBulk {
public static void main(String[] args) throws IOException, InterruptedException {
int batchSize = Integer.parseInt(args[1]);
Pattern p = Pattern.compile("GET /ongoing/When/\\d\\d\\dx/(\\d\\d\\d\\d/\\d\\d/\\d\\d/[^ .]+) ");
LineCounter counter = new LineCounter(p);
Thread t = new Thread(counter);
t.start();
BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(args[0]), "US-ASCII"));
String line = null;
List<String> batch = new ArrayList<String>(batchSize);
while ((line = in.readLine()) != null) {
batch.add(line);
if (batch.size() == batchSize) {
counter.addLines(batch);
batch = new ArrayList<String>(batchSize);
}
}
counter.addLines(batch);
counter.finished();
t.join();
in.close();
List<Entry<String, Integer>> results = new ArrayList<Map.Entry<String, Integer>>(counter.getCounts().entrySet());
Collections.sort(results, new Comparator<Entry<String, Integer>>() {
public int compare(Entry<String, Integer> o1, Entry<String, Integer> o2)
{
return o2.getValue().compareTo(o1.getValue());
}
});
for(int i = 0; i < 10; i++) {
System.out.println(results.get(i));
}
}
static class LineCounter implements Runnable {
private static final List<String> FINISHED = new ArrayList<String>();
private final BlockingQueue<List<String>> queue;
private final Pattern pattern;
private final Map<String, Integer> counts = new HashMap<String, Integer>();
LineCounter(Pattern pattern) {
this.queue = new ArrayBlockingQueue<List<String>>(100);;
this.pattern = pattern;
}
public void run() {
List<String> lines = null;
try {
while ((lines = queue.take()) != FINISHED) {
for (String line : lines) {
Matcher m = pattern.matcher(line);
if (m.find()) {
String key = m.group();
Integer currentCount = counts.get(key);
counts.put(key, (currentCount == null ? 1 : (currentCount + 1)));
}
}
}
}
catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
void addLines(List<String> batch) throws InterruptedException {
queue.put(batch);
}
void finished() throws InterruptedException {
queue.put(FINISHED);
}
Map<String, Integer> getCounts() {
return counts;
}
}
}
IMO think of processes as free is pretty misguided. The Erlang scheduler isn't magic. And sending messages between processes is going to involve some locking somewhere inside the interpreter. If you want amazing scalability then "share nothing" is the only way to go, and message passing between processes is not "share nothing"
[link]
From: Chris Anderson (Sep 29 2007, at 11:50)
Assaf Arkin has an interesting lightweight ruby implementation designed for simplicity rather than speed. One could apply his approach in a more performant manner, if only Ruby's threads could jump cores.
[link]
From: Pete Kirkham (Sep 29 2007, at 14:24)
I would have thought that a language which implicitly parallelises constructs across the available cores, such as Fortress http://fortress.sunsource.net/ (or the various parallel Fortran, or C, or MPI implementations), would be better suited to this problem, rather than actor model implementations such as erlang, which is designed to allow the expression of massively concurrent systems on a limited number of cores without the cost of OS thread context switching.
[link]
From: Norbert Klamann (Sep 30 2007, at 23:03)
@CB : "Message passing is not share nothing". ?
I think it is. You need the lock if you use a piece of memeory from 2 or more processes. That is sharing. Message passing hands the massage and the reponsibility to another process, it sends a copy of all relvant data too.
If the amount of data is too big you use a database.
This is the Erlang POV as I understand it (but I am new to Erlang). Processes are meant in the Erlang sense : something like very light threads.
[link]
From: Bruno (Oct 01 2007, at 13:26)
If Java's too uncool to mention, here's my attempt using... (*drumroll*) grep processes and unix pipes.
(The following times are for 4 log files for a total of 116M or 547856 entries on a 2.33 GHz Intel Core 2 Duo MacBook Pro using standard software.)
I first tried Tim's Ruby version (with a modified regexp) and an almost identical Perl version. Ruby takes 2.19s, while Perl does it in 1.31s, so I guess this is the "approximately twice faster" that Tim observes.
On to some parallel versions. Let's split up the functionality in two functions: one to extract and sort requests and another one to rank them:
$ function sreqs { egrep -ho "GET [^ .]+ " "$@" | sort; }
$ function rank { uniq -c | sort -nr | head -n 10; }
The first attempt is to just connect these two:
$ sreqs access_log.* | rank
This already achieves a timing of 1.13s. However, I assume 'sort' does most of its work after all the input was read. Let's parallellize the sort and do a faster merge (-m) of the previously sorted results afterwards. This is quite easy to set-up using bash's process substitution "<()":
$ sort -m <(sreqs access_log.1 access_log.2) <(sreqs access_log.3 access_log.4) | rank
Now the time is down to 0.64s. The log files are still processed in pairs though. Let's launch an "sreqs" subprocess for each logfile:
$ sort -m <(sreqs access_log.1) <(sreqs access_log.2) <(sreqs access_log.3) <(sreqs access_log.4) | rank
This does not improve the time anymore (it raises a little to 0.67s).
[link]