There are ripples spreading across the concurrency pond, some provoked by this series, and there’s this one issue that keeps coming up: “Concurrency” vs. “Parallelism”. It is asserted that they’re not the same thing, and that this is important.

[This is part of the Concur.next series.]

From the wilds of Haskell-land comes Parallelism /= Concurrency by (I believe) Simon Marlow. Ted Leung’s piece is similarly titled: Concurrency => Parallelism; it in turn references Peter Van Roy’s Programming Paradigms for Dummies: What Every Programmer Should Know (PDF).

I found Van Roy’s paper worth reading, once you get past all the verbiage about how wonderful what he’s about to tell you is; his discussion of P vs. C was the clearest I’ve encountered.

I’m with Ted when he says “From where I sit, this is all about exploiting multicore hardware.” I’d modify that by adding “... in a way that has a low barrier to entry for typical application programmers.” And on that basis, I’m really having a hard time following the C-vs.-P narrative where it seems to lead.

Marlow’s argument is that if you are so fortunate as to be living in Haskell’s benevolent side-effect-free glow, “you are free to run different parts of the program at the same time without observing any difference in the result.” I.e. it’s deterministic, which is a Good Thing. Models based on processes and actors, on the other hand, accept that they execute in an unpredictable order, maybe parallel or maybe not, and are thus non-deterministic, which Must Be Bad.

Which, if true, means that Erlang and Clojure, for example, couldn’t possibly work. Well, except for they do. Oh, well, Marlow allows, “STM is better than locks, and message passing has its advantages”.

(And tuple spaces must be even more pathetic-and-hopeless, because they’re even less deterministic; you don’t even know where the messages you’re sending are going to end up.)

I think that the point that Marlow and the FP tribe are trying to make is important; an important virtue of their approach is that the speedup due to concurrency/parallelism can to some degree be achieved automatically by the system; the process/actor approach on the other hand requires the developer to think about what happens in parallel with what and who has to tell whom what to do. (Having said that, Marlow goes on to discuss “par/pseq” and Nested Data Parallelism, both of which seem to put some of the parallelism under developer control.)

So the jury is out; I’d be unsurprised if FP turns out to be an important building block for the Java of Concurrency. But I’d be quite surprised if there turned out to be no role for the process/actor approach. At least some in the Haskell community seem to agree.

This sort of wholesale dissing of a family of approaches that seems empirically to work in some important application scenarios really isn’t helpful.



Contributions

Comment feed for ongoing:Comments feed

From: Martin Probst (Oct 08 2009, at 01:39)

What you have not mentioned yet (iirc) is the approach taken by Digital Mars / D:

http://en.wikipedia.org/wiki/D_(programming_language)

They have elaborated on "const" as in C(++) and came up with a notion of transitive immutability which also applies to functions.

http://www.digitalmars.com/d/2.0/const3.html

So they have basically extended a regular OO programming language with functional (no side effects) parts that can be proven by the compiler. I'm not sure if they already use this for concurrency, but it of course also helps checking program correctness.

I think this might indeed be a way to get the Java of Concurrency - take the Java of today, and add functional programming to it. Then you can have all the nice "I can parallelize this function by partitioning the array ..." effects and still have your programmers in a (mostly) comfortable environment.

[link]

From: DGentry (Oct 08 2009, at 04:53)

> I’d be unsurprised if FP turns out to be an

> important building block for the Java of

> Concurrency. But I’d be quite surprised if

> there turned out to be no role for the

> process/actor approach

This would imply that the runtime - presumably, a virtual machine - will always have a pool of threads available for use either explicitly by the programmer, or transparently to farm out work which is inherently safe.

This could lead to the situation where a programmer explicitly working to parallelize part of their application results in no net speedup, as the runtime was already able to find enough work to run on all cores. We'd need the profiling tools to be aware of the behavior of all threads, to point to areas of the program which are under-using the available CPUs.

[link]

From: Don Stewart (Oct 08 2009, at 10:06)

Simon's not arguing that approaches to parallelism that allow for non-determinism are "never going to work". He's been working on Concurrent Haskell for 20 years, after all.

I think what he is arguing though is that it there is a *useful distinction* between abstractions that allow for fully deterministic parallelism (such as par/pseq strategies, and NDP), and those where things can still go wrong (explicit threads and locks/STM).

[link]

From: Fred Blasdel (Oct 08 2009, at 10:07)

A problem you seem to be willfully ignoring is that nobody but desktop developers should be fretting about <i>exploiting multicore hardware</i>.

If you're operating a service, you need to be able to <i>exploit multiple machines</i>! Focusing on multicore hype is optimizing in entirely the wrong direction, especially since you can just run multiple independent processes on one machine to fully utilize your CPUs.

[link]

From: Manuel Chakravarty (Oct 08 2009, at 14:37)

You write about Marlow's argument, "Models based on processes and actors [..] Must Be Bad." Marlow doesn't say that these models are bad. He writes that they focus on concurrency and not on parallelism, which is great if your aim is to write an explicitly concurrent program, but quickly becomes tedious if your aim is to speed up an otherwise perfectly sequential application for the sole purpose of increasing its performance on parallel hardware.

You continue to write about Marlow's argument, "Which, if true, means that Erlang and Clojure, for example, couldn’t possibly work. Well, except for they do." Yes, Erlang does work great, but for concurrent programs! The first Erlang book is entitled "Concurrenct Programming in Erlang" (Armstrong et al). If you read this book, you will find that it is not about improving performance on multicore hardware, but instead it is about implementing applications (such as telecommunications software) that is multi-threaded as it comes from an inherently concurrent (and indeterministic) problem domain.

Besides if you ask Joe Armstrong (of Erlang fame) about the wisdom of combining side-effects and concurrency, he will be the first to tell you that the two do not mix well at all, and that this insight is at the core of the design of Erlang. This just supports much of what Marlow writes.

[link]

From: gvb (Oct 08 2009, at 18:37)

Very interesting series, Tim, thanks!

Background: I have 20++ years experience in writing embedded programs based on real-time operating systems (including writing roll-yer-own RTOSes). I am very familiar with the concurrency problems of coordinating multiple tasks, locking, unlocking, and race conditions. :-O

IMHO, the intractable problem is dependencies and parallelism does not (cannot) solve the dependency problem. Side-effect free languages do *not* solve input dependencies: a calculation cannot be completed until *all* input values are known. Thus the fastest a program can run will be the time to calculate all input values plus the delta for the final output value calculation (note that logic, i.e. branching, is simply boolean calculations). The most that a given program can be parallelized will be the number of independent values that can be calculated at any given time.

Simple dependency illustration:

A = B + C

D = A + F + G

The value A and the partial sum F + G can be calculated in parallel but the value D *cannot* be determined until the value of A is known.

A common algorithm with severe data dependencies is calculating the Fibonacci sequence: the two previous Fibonacci numbers are needed to calculate the next one. Using a side-effect free language on a massively parallel computer will not generate Fibonacci numbers any faster than an equivalent side-effect based language running on just one processor in that computer.

I contend that, in the Real World[tm], the most common problems are ones that are dominated by highly dependent data (often combinatorial logic). These problems are rife with data dependencies and thus have hard limits to how much can be calculated in parallel. Relatively speaking, only a minority of Real World problems are characterized by highly independent data - array/vector processing, image processing (including video displays), etc.

Even the "wide finder" project had inherent parallelizing limitations:

1. The input data has to be split up before any parallel processing can be started.

2. Within the split data set, the problem cannot be split any smaller than one line.

3. The summing of the counts cannot be completed until all the lines are processed.

--------

Looking at the progression of parallel execution within a CPU is instructive.

First there was simple in-order instruction execution. This was easy to understand, but (software) people clamored for more speed.

To speed things up, CPU designers added out-of-order instruction execution, but synchronized things so that the instructions were retired in-order, giving the illusion of in-order execution. This was great, but the CPUs still were blocked due to data dependencies.

To minimize data dependency-related stalling, CPU designers added register renaming (so register "R1" could be used "simultaneously" by multiple instructions, as long as the data held in "R1" was independent. They also added speculative execution (branching is a form of data dependency: the branch is dependent on the flag controlling the branch). Speculative execution runs *both* branches until the conditional is resolved, then it throws away the "wrong" branch's results. Speculative executing speeds up CPUs, but it is very brute force and very limited.

Here is a question for CPU architects: how much overhead logic is required to accommodate register renaming and speculative execution? My understanding is that it is a significant fraction of the CPU logic (i.e. not counting on-die cache). Furthermore, while register renaming and speculative execution helps, sometimes significantly, there is only so much that it can do to speed up a CPU. One of the reasons for our multicore dilemma is *because* CPU manufacturers ran out of gas with the brute force approach of register renaming and speculative execution and did a philosophical pivot to multicore. IOW, they just pushed the concurrency/parallelism problem back onto us software weenies.

Trivia: Google reportedly does speculative execution in their search engine: they farm out a search to a pile of processors and use the first answers that come back, discarding later answers. They even use this to their benefit since it makes the search resilient to processor failure.

References:

<http://en.wikipedia.org/wiki/Register_renaming>

<http://en.wikipedia.org/wiki/Fibonacci_number>

[link]

From: gvb (Oct 08 2009, at 19:39)

This was just highlighted on Hacker News, states what I was trying to say only more eloquently:

http://www.denbeste.nu/essays/futurecs.shtml

"[...] The difficulty is that we don't know how to apply the power of a parallel computer to solve serial problems. That is what we must learn.

"There are many kinds of problems which are fundamentally parallel to begin with, which can easily absorb enormous amounts of parallel computation capability with ease. The best example of that is 3D rendering, where each individual pixel can be computed independently of all the others, and where that computation itself can be pipelined. You could use ten or more processors per pixel, plus thousands of others doing overhead and bookkeeping and optimization. If you're rendering a 1600*1200 animation in real time, you could easily apply as many as twenty million processors with almost linear scaling.

"On the other hand, some problems are fundamentally serial: they involve calculating a lot of values but each calculation requires knowing the answer to the previous calculation.

[...]

"The problem is that we only have the most primitive ideas of how to program [a massively parallel computer], except when taking on problems which are fundamentally parallel anyway. This is the great problem for computer science in the first half of the twenty-first century."

[link]

From: Robert Young (Oct 10 2009, at 18:35)

"This is the great problem for computer science in the first half of the twenty-first century."

The point is, as I've mentioned on another thread, this problem was faced (and not solved) in the 1980's. It's been 30 years, and the math (I submit that any answer must be found in fundamental maths, not ad hoc coding; i.e. a replacement for the von Neumann architecture) hasn't been found. If I were smart enough, I would do it, get rich and move to Bermuda.

Wheel spinning in language specifications is wasting time. The problem isn't in languages and won't be solved with the Next Great Concurrent Java because there cannot be one.

[link]

From: Jeremy Dunck (Oct 12 2009, at 07:55)

Robert,

If true, what would you recommend as a course? Should we all go back to being assembly programmers so we can eek out the best possible performance?

[link]

From: Robert Young (Oct 12 2009, at 19:29)

Jeremy:

Not near my point, which is that specifying some new language, or adding more cruft to any existing language, won't solve the problem. It's a waste of time. It's not a problem that can be solved at the application language level, including assembler.

What the parallel pioneers in the 80's found, to their dismay, is that 1) linear problems can't be sped up with parallel machines, and 2) that parallel problems can, but they ain't many of those. Writing in assembler (or any known language) is no solution.

The reality is that multi-threaded/core/processor machines are merely a bunch of von Neumann units stitched together; using such a machine doesn't change the processing semantics, it merely piles many on top of many. To utilize parallel machines we either need parallel problems, e.g. servers of all types (multi-user and multi-programming problems; I'll toss in classic array processing here, too), or compilers which transparently disassemble linear programs (not linear programming programs; these are natively suited to such machines) into parallel ones. The TM folks had bespoke compilers; so far as I know none was smart enough to do what I've just described.

Or, ultimately, a new machine semantic, different from von Neumann, which does the transformation in hardware. We have absurd transistor counts to work with, but not a new way to use them. Last I knew, a significant proportion of that count goes to caches; that's not progress. That's admission of futility.

This is why I say the issue is with the maths: some new set of theorems which drive the decomposition of linear problems into parallel ones. Whoever figures that out will be rich, and likely resemble Beldar.

Current situation:

2 + 2 = 4 (linear solution)

2 + 2 + 1 = 4 (ignore the 1, it's just overhead to the parallel solution)

Until such a maths is found, I remain convinced that the path will continue to be Back to The Future: the early 90's where we had dumb terminals connected to smart databases. The terminals just displayed a memory map of the client process on the server. With netbooks (simplistic processors, albeit pixelated display) and high bandwidth, that is where we are headed. The gorilla machines are the database and web servers.

This presents a problem for the likes of Intel/M$, since the client side is no longer a source of cycle swallowing. ARM is the most likely winner, in the medium term. Short term, M$ will still be able to hold back the tide. But that's only short term. As the core count goes up, but frequency levels off at best, Office and such will either have to freeze features or slow down. Office programs are inherently linear; only if M$ can figure a way to parallelize execution (not likely as doing so assumes the maths which don't exist) can they continue to add bells and whistles without irritating slowdown. Some, humble self included, have been predicting this for a few years; we've seen this movie before.

It's also no accident that the Erlang book is about writing servers. As I said in an earlier msg., one way around the situation is for some young buck to invent a problem to fit the solution. One possible "problem" is to redefine what we mean by Application Program: all programs are servers. Whether such a paradigm actually makes more useful programs for users is up in the air. It does provide a rationale for coders to continue writing client applications on the Intel architecture. Sorry, server applications for use by client machines. Short of a replacement for von Neumann, I suspect that this is the path of least resistance.

[link]

From: Robert Young (Oct 13 2009, at 06:15)

And here is an exhaustive precis of the problem:

http://ei.cs.vt.edu/~history/Parallel.html

[link]

From: derekdb (Oct 13 2009, at 12:52)

Trying to find the one-true solution to building distributes systems is inherently flawed, just as the search for the one-true language is flawed. Different problems have different constraints. The best solution to use multiple cores on a single host is not necessarily the best solution for building software that scales horizontally. Building solutions that scale horizontally to 10s of machines is fundamentally different than scaling to 1000s of machines.

That said, you are right to be poking around and asking questions. There is no strong consensus on how to build any of these yet. This space is evolving fast, and the move to cloud computing is providing experience working with larger clusters than ever before. At the other end, the move to multi-core processors is forcing every developer to have to think about how to parallelize their tasks.

Just like with selecting an appropriate programming language, you need to understand your goals and constraints first. What is sorely missing, is collective knowledge about how to map those goals and constraints to architectures. When is the erlang model a good choice? How does it compare to Clojure's STM? etc.

All of these approaches share some key common elements. They are trying to present a simplified model, that makes it easier for developers to break their problem into smaller units, and reason about those units. The side-effect free models provide the greatest clarity, but are the most alien to the average developer. Even in a side-effect free world, eventually you need to join your data, and the complexity of that is often underestimated. This quickly falls into the emergent complexity problem... Each component behaves in a logical and predictable way, but together they result in emergent behavior that is not logic or easily predictable.

At the end of the day, it is about building solutions that perform their intended function. Efficiency is always secondary. Ruby on Rails is a great example, where improvements to the development process vastly out-weigh the performance cost... for most people. There are cases where it just isn't acceptable, and then you are forced to use more complicated tools. You need to understand your goals and constraints, and how that relates to your tools...

[link]

author · Dad
colophon · rights
picture of the day
October 07, 2009
· Technology (90 fragments)
· · Concurrency (76 more)

By .

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!