This is the second progress report from the Wide Finder Project, and a follow-on from the first, Erlang Ho! The one thing that Erlang does right is so important that I’d like to get past its problems. So far I can’t.

[Update: Several people wrote “show me the data!” Well, OK then. Ten thousand lines of logfile, a little over two meg, may be found at www.tbray.org/tmp/o10k.ap. Anyone who can generate Erlang code that can read this stuff and parse out the ongoing fragment fetches at a remotely competitive speed will get me interested in Erlang again.] [Update II: See the comments: the problem seems to be io:get_line (which is a serious problem, by the way). At first glance, the solutions work for the case when you can read the whole file into memory before you start to process any of it. Looks like I’ll be spending a bit more quality time with E.]

Interactive Irritation · If you type erl at your shell prompt, you’re in Erlang. There are differences between what you can have in a file and what you can type at the prompt. And when you type control-D, nothing happens; to get out, you have to interrupt it with control-C or equivalent. This is wrong.

If you type erlc foo.erl it’ll compile into foo.beam, but you can’t just type erl foo.beam to run it, you have to add a bunch of stupid pettifogging options. This is wrong.

Slow Basics · Yesterday, I reported on Erlang’s appalling regexp performance. Someone using the handle “Masklinn” suggested using pattern-matching with what Erlang calls “binaries” instead. So I did. Let’s leave the add-’em-up part of my problem out, and zero in on the problem of getting Erlang to read the 1,167,948 lines of logfile and select the 105,099 that are fetches of ongoing fragments.

I coded it up using patterns per Masklinn’s suggestion and it was really much better, burning only 56.44 CPU seconds on my MacBook. The code looks like this:

process_match(<< "/ongoing/When/", Trailer/binary >>) ->
    Last = binary_to_list(Trailer),
    case lists:member($., Last) of
	true -> 0;
	false -> 1
    end;
process_match(_) -> 0.

scan_line(eof, _, Count) -> Count;
scan_line(Line, File, Count) ->
    Uri = list_to_binary(lists:nth(7, string:tokens(Line, " "))),
    NewCount = Count + process_match(Uri),
    scan_line(io:get_line(File, ''), File, NewCount).

At this point I smelled a rat; in particular, an I/O rat. So I ripped out all the fragment-recognition crap and measured how long it takes Erlang to read the lines:

scan_line(eof, _, Count) -> Count;
scan_line(_, File, Count) ->
    scan_line(io:get_line(File, ''), File, Count + 1).

That was better: a mere 34.165 CPU seconds.

Except for, on the same MacBook, my simple little Ruby program read the lines, parsed out the fragment fetches, and did all the totaling and sorting in, let’s see... 3.036 CPU seconds (3.47 seconds elapsed).

Hold On a Second · Unlike Ruby, Erlang is highly concurrent. So if I ran it on an 8-core machine, the Ruby would run at the same speed, but the Erlang ought to go faster. Except for parallelizing line-at-a-time I/O from a file would be hard. And even if you could, and go eight times as fast, and even leaving out all the matching and adding, you’re still half again as slow as Ruby. This is wrong.

Dear Erlang I: · I like you. Really, I do. But until you can read lines of text out of a file and do basic pattern-matching against them acceptably fast (which most people would say is faster than Ruby), you’re stuck in a niche; you’re a thought experiment and a consciousness-raiser and an engineering showpiece, but you’re not a general-purpose tool. Sorry.

I’m done with beating my head against Erlang for now. If someone can show me how to make it read and pattern-match at a remotely competitive speed, I’ll be able actually to look at that nice concurrency stuff; and I’d like to.

Dear Erlang II: · Thank you for helping me think about the Wide Finder problem. As a result of these days together, I think I know what the ideal solution would look like. I suspect it’s not out there yet, but I bet I can recognize it when I see it. More on that later.



Contributions

Comment feed for ongoing:Comments feed

From: Evan (Sep 22 2007, at 21:53)

For what it's worth, and I'll admit that it's been a while since I checked it out, Termite Scheme is a Scheme implementation which is meant have Erlangesque concurrency features. It's built of Gambit Scheme, which is meant to be one of the faster Schemes. It lives here: http://toute.ca/. It was done as a thesis project, so it might not be current or well supported.

Keeping with the insect theme, there was something called Mosquito Lisp, done by some people called (appropriately enough) Ephemeral Security. It appears to be gone, though, now that I've gone looking for the link. It's http://ephsec.squarespace.com/mosquito-lisp/ for what it's worth. Perhaps it'll be back up soon.

I'm following this one closely. Something that goes seamlessly from simple scripts to writing 'enterprise class' concurrent systems more or less seamlessly is something that I've been waiting for a good long while now.

[link]

From: Steve Vinoski (Sep 23 2007, at 00:17)

Hi Tim, to get out of erl, just type "q()." minus he quotes (and don't forget that final period/full stop).

[link]

From: Alan Little (Sep 23 2007, at 00:19)

I think (admittedly without personal experience of trying to use it for anything serious) Erlang is rather more than a "showpiece". All those Ericsson heavy duty backbone routers and switches that have been up for years aren't just for show.

Perhaps it would be fairer to say that it's a specialised piece of heavy duty industrial machinery and not a general purpose tool? It seems to me you are to some degree complaining that a nuclear reactor isn't a very good tool for making a cup of tea.

[link]

From: Steve Vinoski (Sep 23 2007, at 00:44)

Also, Tim, can you do what Thomas Lackner suggested in a comment to your previous post and make a file available with the data you're trying to process? It's hard to solve a data processing problem without the data. ;-)

[link]

From: Martin Probst (Sep 23 2007, at 01:22)

Did you check out Scala Actors?

Scala is a very compelling Java replacement, and it has an implementation of Erlang-style concurrency in a library called Scala Actors. Might be worth checking out.

[link]

From: Masklinn (Sep 23 2007, at 02:21)

> I think (admittedly without personal experience of trying to use it for anything serious) Erlang is rather more than a "showpiece". All those Ericsson heavy duty backbone routers and switches that have been up for years aren't just for show.

Erlang clearly isn't a showpiece, but Erlang was never designed with string manipulation in mind, and it shows.

Erlang will make it easy to serve almost anything, or create a local botnet to try to DDOS a box for testing purposes, but saying that heavy string lifting isn't its forte is quite an understatement.

[link]

From: Pete Kirkham (Sep 23 2007, at 03:13)

Making up a quick and dirty KMP match in C on a amd64 laptop, running it with while so the cpu actually gets to full speed, and comparing with one which just reads the data and doesn't match:

$ while true; do time bin/string_match_bench datasets/original/hundred-o10k.ap > /dev/null; done

real 0m1.252s

user 0m1.084s

sys 0m0.156s

$while true; do time bin/read_bench datasets/original/hundred-o10k.ap > /dev/null; done

real 0m0.139s

user 0m0.000s

sys 0m0.132s

Timings are for a the sample data concatenated a hundred times, otherwise they values for read_bench are lost in the noise.

So there may be room to farm the matching off to different CPUs, or write a better matching algorithm (this test doesn't separate splitting the IO block into lines and scanning them; the former is less obviously parallel).

[link]

From: Fernando J. Pereda (Sep 23 2007, at 04:17)

Well... this code is able to process your testcase in less than 3 seconds in my dog slow AthlonXP 2000+:

[I don't know if the code will come 'as-is' so I'm posting it to http://dev.gentoo.org/~ferdy/tmp/reader.erl]

-module(reader).

-export([results_for/1, read_lines/1]).

results_for(File) ->

N = read_lines(File),

io:format("Found ~w matches in ~s~n", [N, File]).

read_lines(File) ->

{ok, Rest} = file:read_file(File),

List = binary_to_list(Rest),

Lines = string:tokens(List, "\n"),

parse_lines(Lines, 0).

parse_lines([], N) ->

N;

parse_lines([H|T], N) ->

Uri = list_to_binary(lists:nth(7, string:tokens(H, " "))),

NewN = N + process_match(Uri),

parse_lines(T, NewN).

process_match(<<"/ongoing/When/", Trailer/binary>>) ->

Last = binary_to_list(Trailer),

case lists:member($., Last) of

true -> 0;

false -> 1

end;

process_match(_) -> 0.

Instead of parse-lines, use a map function or a parallel map and things should be better and way less ugly.

- ferdy

[link]

From: igwan (Sep 23 2007, at 04:22)

You're right, io:get_line is DAMN slow ! Because it reads the file character by character ... This function is meant to read input from a keyboard, not to split lines on a file.

When I want to parse files in erlang, I read it in full first :

{ok, Binary} = file:read_file(Filename).

Then work on the Binary. Splitting your file into lines using pattern matching then takes 0.891 s on my 2,00 Ghz CPU.

Granted, Erlang lacks a good regexp library that works on binaries. It heard someone on the erlang mailing-list is currently working on it.

Regarding concurrency, you say : "So if I ran it on an 8-core machine, the Ruby would run at the same speed, but the Erlang ought to go faster.". Concurrency in Erlang is explicit, not automatic. In your case, I/O could not be parallelized but the parsing could. I would just spawn a bunch of process, and send each one a number of lines to parse, then collect the results.

[link]

From: Toby DiPasquale (Sep 23 2007, at 04:41)

Perhaps you should try Scala <http://www.scala-lang.org/>? It has an Actor library to provide similar concurrency features as Erlang, but it compiles down to JVM bytecode and is purported to run at near-Java speeds. As well, it can call into any Java library if you need that sort of thing.

[link]

From: Jeff Hodges (Sep 23 2007, at 04:58)

I was unable to reproduce your results on my own machine. At least, not the simple count. It returns within a second or 2. Was the log being written to while you were testing? Am I correct in assuming that you set `File` with `{ok, File} = file:open('/path/to/o10k.ap', read)`?

Now, I was unable to run the more complex form of scan_line without an error in `string:tokens1`. What arguments are you passing to `scan_line` initially?

I've written some code that I believe is a bit more idomatic than the code you posted here. That runs very quickly on my own MacBook Pro. Could you try both the `count_lines` and `count_tbray_matches` functions and reproduce the results from it? http://somethingsimilar.com/code/fold_lines.erl

We should not that I shamelessly stole the basis of this code from TrapExit: http://www.trapexit.org/Count_Lines_in_a_File

The fold_files version of `countlines` runs at under a second on my MacBook Pro, and `count_tbray_matches` runs in just over a second or 2. 756 is the number of matches, correct?

[link]

From: Ceri Storey (Sep 23 2007, at 05:21)

For what it's worth, you might want to consider anonymising the data in that logfile. I guess you don't have an explicit privacy policy, but all that data complete with internet protocol address ("Oh noes, you have my IP address") might upset a few people.

Of course, the fact that they're voluntarily visiting a site where someone else controls the logging is another point entirely.

[link]

From: Ulf Wiger (Sep 23 2007, at 06:08)

It is true that this form of I/O (esp combined with text processing) is a weak point of Erlang's. This is plainly evident in the shootout (http://shootout.alioth.debian.org/).

One may keep in mind that the type of I/O that Erlang applications normally encounters is in the form of thousands of concurrent streams. It is quite common for Erlangers to have to lift the default limit on file descriptors, and using about as many as they are allowed. Furthermore, the amount of data passed on each stream (signaling data) is normally quite small, but response times are of great importance.

While it would be great if it were also very good at reading one stream really quickly, this hasn't been an especially interesting problem in the type of application where Erlang is traditionally used. I appreciate that this doesn't satisfy those who care particularly applying regexps to large amounts of file data.

FWIW, it's not an intrinsic problem with Erlang, but rather a result of implementation choices in the Erlang virtual machine.

BR,

Ulf Wiger

[link]

From: Anders N (Sep 23 2007, at 07:36)

I am not sure how this will come out, since there is no preview.

But the following takes ~0.6 s on my core2 duo laptop.

lines() ->

{ok,B} = file:read_file("o10k.ap"),

S=binary_to_list(B),

Lines = string:tokens(S,"\n"),

lists:foldl(fun (Line, Acc) ->

Acc + process_match(lists:nth(7, string:tokens(Line, " ")))

end,

0, Lines).

process_match("/ongoing/When/" ++ Last) ->

case lists:member($., Last) of

true -> 0;

false -> 1

end;

process_match(_) -> 0.

[link]

From: Thomas Lindgren (Sep 23 2007, at 08:40)

Reading the whole file into a binary takes 7.3ms on my machine (1.7 GHz, 0.5 GB), so 34 seconds of I/O seems appalling.

1> T0 = erlang:now(), file:read_file("o10k.ap.html"), T1 = erlang:now(), timer:now_diff(T1,T0).

7316

And just to check:

2> {ok, B} = file:read_file("o10k.ap.html"). {ok,<<"<html><head></head><body>host-24-225-218-245.patmedia.net - -\n[01/Oct/2006:06:33:45 -0700] \"GET /ongoing/ong"...>>}

3> size(B).

2016367

(As you can see, I just saved the page from the browser. I guess it was raw text originally.)

[link]

From: Alan Griffith (Sep 23 2007, at 08:52)

What about Parallel Python. Of course it's not Erlang, and I'm not arguing Python vs. Ruby, but that would be one place I'd look for to answer this kind of problem.

[link]

From: Matt Brubeck (Sep 23 2007, at 08:58)

"Except for parallelizing line-at-a-time I/O from a file would be hard."

No, it's easy. The first step is to split the file into N files, one for each thread or process.

[link]

From: Caoyuan (Sep 23 2007, at 10:46)

The data file www.tbray.org/tmp/o10k.ap only contains 10,000 lines, so the result around 0.4s should be multipled by 100 to gain the result for 1,167,948 lines data file that Tim's case.

I tested various solution on 100 times copied of www.tbray.org/tmp/o10k.ap (It's a 240m size file), and got the result around 40s same as Tim.

test/1, test1/1 and test2/1 cost almost the same time. That is, parsing/matching the binary or list is not time consumer in Erlang. Actually the bottle neck is reading binary or stream byte by byte, please see test2/1.

<pre>

-module(widefinder).

-export([test/1,

test1/1,

test2/1]).

test(FileName) ->

statistics(wall_clock),

{ok, IO} = file:open(FileName, read),

{Matched, Total} = scan_line(IO),

{_, Duration} = statistics(wall_clock),

io:format("Duration ~pms~n Matched:~B, Total:~B", [Duration, Matched, Total]).

scan_line(IO) -> scan_line("", IO, 0, -1).

scan_line(eof, _, Matched, Total) -> {Matched, Total};

scan_line(Line, IO, Matched, Total) ->

NewCount = Matched + process_match(Line),

scan_line(io:get_line(IO, ''), IO, NewCount, Total + 1).

process_match([]) -> 0;

process_match("/ongoing/When/"++Rest) ->

case parse_until_space(Rest, false) of

true -> 0;

false -> 1

end;

process_match([_H|Rest]) -> process_match(Rest).

test1(FileName) ->

statistics(wall_clock),

{ok, Bin} = file:read_file(FileName),

{Matched, Total} = scan_line1(Bin),

{_, Duration} = statistics(wall_clock),

io:format("Duration ~pms~n Matched:~B, Total:~B", [Duration, Matched, Total]).

scan_line1(Bin) -> scan_line1(Bin, [], 0, 0).

scan_line1(<<>>, _Line, Matched, Total) -> {Matched, Total};

scan_line1(<<$\n, Rest/binary>>, Line, Matched, Total) ->

%Line1 = lists:reverse(Line),

scan_line1(Rest, [], Matched, Total + 1);

scan_line1(<<C:1/binary, Rest/binary>>, Line, Matched, Total) ->

%NewCount = Matched + process_match(Line),

scan_line1(Rest, [C|Line], Matched, Total).

test2(FileName) ->

statistics(wall_clock),

{ok, Bin} = file:read_file(FileName),

Total = travel(Bin),

{_, Duration} = statistics(wall_clock),

io:format("Duration ~pms~n Total:~B", [Duration, Total]).

travel(Bin) -> travel(Bin, 0).

travel(<<>>, ByteCount) -> ByteCount;

travel(<<_C:1/binary, Rest/binary>>, ByteCount) ->

travel(Rest, ByteCount + 1).

parse_until_space([$\040|_Rest], Bool) -> Bool;

parse_until_space([$.|_Rest], _Bool) -> true;

parse_until_space([_H|Rest], Bool) -> parse_until_space(Rest, Bool).

</pre>

[link]

From: Masklinn (Sep 23 2007, at 10:54)

To those who can't reproduce the result, the provided file is only 1% of the original logfile (10k lines versus 1,167,948), so your result have to be multiplied by 100 to have a good estimate of a "final" time.

Also, if you want to see what to compare against, Tim made his Ruby script available previously:

counts = {}

counts.default = 0

ARGF.each_line do |line|

if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }

counts[$1] += 1

end

end

keys_by_count = counts.keys.sort { |a, b| counts[b] <=> counts[a] }

keys_by_count[0 .. 9].each do |key|

puts "#{counts[key]}: #{key}"

end

Just dump it in a .rb file, then `ruby yourscript.rb logfilename`, that way you can time the baseline (for your machine) and compare the outputs (note that it only prints the 10 most requested urls).

Finally, tim, since R11B4 there is a tool called "escript". Allows you to start simple erlang scripts from the CLI, but you can also use it to bootstrap your "main" script (e.g. `escript parser` will call parser.start() or something like that)

[link]

From: Wil Carlton (Sep 23 2007, at 10:59)

The Tim Bray gravy train, which should last another ten or twenty years:

On the discovery of a new programming lanugage X, perhaps one that's becoming trendy among the Pragmatic hipsters:

- try out tired character-grabbing idioms that worked in C|perl|php|ruby, blah. . .

- Speak at (X).Conf about why X sucks at grabbing characters in an I18N world

- While you're in there, might as well haul in some XML baggage to test out. and be sure to let us know how miserably that failed, too.

End /* On */

It's reasonable that not all programming languages are intended for "finding things" in text files. Why condemn future language designers to solving a problem that's already been excruciating

hashed out already?

If I were doing an Erlang diary, I'd investigate its strengths and why it's different from <insert best-selling O'Reilly programming language book title here>

Good luck.

[link]

From: Toby (Sep 23 2007, at 11:16)

Matt, isn't it even easier - assuming the 'processing' part is non-trivial (and it must be if you expect to gain from multicore) - just have a read process which dispatches data to a bunch of other processes.

[link]

From: Seth W. Klein (Sep 23 2007, at 11:22)

It's net friendly to remember to zip 2MB text files when you put them online :)

[link]

From: Bob Ippolito (Sep 23 2007, at 14:04)

I believe the fast path for Erlang is actually to use a port to do this, since the majority of Erlang's communication happens at that layer and you can get that to do line parsing and stuff for you.

Here's a version that uses a port and cat to do it, which runs at least twice as fast on my MBP than the "reader.erl" example from a previous comment, but that advantage probably goes up with a larger input since it includes the process spawn overhead of open_port. It also doesn't read the whole file into memory, so you could stream gigs of logs

through it, even on a 32-bit erlang :)

The zip includes a copy of the example log file plus a utility module it uses to do shell quoting of file names for open_port.

undefined.org/erlang/o10k.zip

<code>

(emacs@ahi.local)50> timer:tc(o10k, results_for, ["o10k.ap"]).

{270716,1101}

(emacs@ahi.local)51> timer:tc(reader, results_for, ["o10k.ap"]).

Found 1101 matches in o10k.ap

{600700,ok}

</code>

[link]

From: Victor (Sep 23 2007, at 14:36)

I think I should point out that the Erlang concurrency does occur automatically. If you are not splitting your work into separate processes there's nothing to run concurrently.

The code you listed above can't take advantage of your multiple cores because you wrote it that way.

[link]

From: Clayton Wheeler (Sep 23 2007, at 14:56)

I've been poking around with some alternative I/O strategies, specifically using file:open(Path, [read, raw, read_ahead}) with file:read/2 in 1K chunks, and also opening an I/O port to "cat <filename>" in line-at-a-time mode.

Both of these approaches take 140-150 ms of user time for just reading the lines of your file, using HiPE on an old slow x86 machine. (The port strategy is almost twice as fast as the open/read one on my MacBook, oddly; I'd guess that's HiPE optimizing the line-splitting code.) Doing the same thing, i.e. nothing, with Ruby's ARGF.each_line takes 141 ms, so I'd say that's competitive at the I/O level.

[link]

From: Steve Vinoski (Sep 23 2007, at 21:57)

Hi Tim, I posted an approach that makes clear use of multiple cores, which I think is what you were hoping for, over on my (new) blog:

http://steve.vinoski.net/blog/2007/09/23/tim-bray-and-erlang/

[link]

From: Akhilesh (Sep 23 2007, at 23:57)

Tim, Erlang is primarily used for its concurrency and high availability features. Saying that it should first get some XYZ thing right before you'd bother with its primary features, is not the right approach... that XYZ thing (IO and pattern matching) has _never_ been erlang's focus. Perl/Ruby is *designed* for that purpose, and I expect it to blow erlang out of water on that... and of course, for concurrency and high availability, erlang returns the favor to Perl/Ruby.

[link]

From: etnt (Sep 24 2007, at 00:14)

[quote]

....you’re stuck in a niche; you’re a thought experiment and a consciousness-raiser and an engineering showpiece, but you’re not a general-purpose tool....

[/quote]

Ha,ha,ha...that's funny!

Hint: Use Perl or something similar to solve your problem and let Erlang do what it does best, i.e complex 24x7 systems.

[link]

From: tsuraan (Sep 24 2007, at 16:37)

"you can’t just type erl foo.beam to run it, you have to add a bunch of stupid pettifogging options. This is wrong"

Erlang's not perl. Lots of languages aren't perl, and don't even try to be "a better perl." That's ok. However, if you want to pretend erlang is a scripting language, use escript. A simple example, taken straight from the escript man page:

tsuraan@macbeth ~ $ cat simple.erl

#!/usr/bin/env escript

main([String]) ->

try

N = list_to_integer(String),

F = fac(N),

io:format("factorial ~w = ~w\n", [N,F])

catch

_:_ ->

usage()

end;

main(_) ->

usage().

usage() ->

io:format("usage: factorial integer\n"),

halt(1).

fac(0) -> 1;

fac(N) -> N * fac(N-1).

tsuraan@macbeth ~ $ ./simple.erl 10

factorial 10 = 3628800

So, if you do feel like continuing to treat erlang as a scripting langauge (and I really don't think you'll ever be as happy with it for that purpose as you are with Ruby or perl), escript might help a bit.

[link]

From: Stephan Schmidt (Sep 26 2007, at 09:12)

When parsing large log files one should use memory mapped files. I did this when parsing very large FTP log files for analysis. Back then I used Java NIO memory mapped files with some success.

Those files are swapped into memory by the operating system memory managment. Can't get faster than that.

Peace

-stephan

--

Stephan Schmidt :: stephan@reposita.org

Reposita Open Source - Monitor your software development

http://www.reposita.org

Blog at http://stephan.reposita.org - No signal. No noise.

[link]

author · Dad
colophon · rights
picture of the day
September 22, 2007
· Technology (90 fragments)
· · Concurrency (76 more)
· · Erlang (4 more)

By .

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!