home page of dan zwell

Haskell hackery

Because there aren't enough newbie haskell tutorials. I release all code samples on this page to the public domain.

Parallel, lazy, map

Every functional programmer uses map. It is probably one of the most commonly used functions. Wouldn't it be nice if we could process a few elements of a list at a time? First, here is a motivation--we want to map a computationally expensive function over a list. Say we want to factor a bunch of numbers. Here is a nice and slow function that factors a number:

factors n = let candidates = [2..floor (sqrt (fromInteger n))] in catMaybes $ map (\x -> if n `mod` x == 0 then Just (x, n `div` x) else Nothing) candidates

Let's calculate how many pairs of factors ten really large numbers have:

bigNums = [2000000000000..] answer = map (length . factors) (take 10 bigNums) main = print answer

You need to import Data.Maybe to compile this. It's not incredibly fast:

$ time ./a.out [90,7,7,0,5,63,7,11,31,3] real 0m5.634s user 0m5.600s sys 0m0.033s

The computation for each number in

take 10 bigNums
is independent, so wouldn't it make sense to do them, say, two or four at a time, if we are on a dual or quad core machine? Haskell has a pretty straightforward mechanism of parallelism with which we can construct a parallel variant of

$ cat primes-test.hs module Main where import Data.Maybe import Control.Parallel.Strategies import Control.Parallel factors n = let candidates = [2..floor (sqrt (fromInteger n))] in catMaybes $ map (\x -> if n `mod` x == 0 then Just (x, n `div` x) else Nothing) candidates bigNums = [2000000000000..] answer = (parMap rwhnf) (length . factors) (take 10 bigNums) main = print answer $ ghc primes-test --make -threaded [1 of 1] Compiling Main ( primes-test.hs, primes-test.o ) Linking primes-test ...

Let's compare the runtimes with and without threads enabled (they must be turned on with the command line parameter "-threaded"):

$ time ./primes-test [90,7,7,0,5,63,7,11,31,3] real 0m5.747s user 0m5.706s sys 0m0.033s $ time ./primes-test +RTS -N2 [90,7,7,0,5,63,7,11,31,3] real 0m3.308s user 0m5.840s sys 0m0.100s

That is almost a 50% speed-up, but enabling more threads than I have cores results in a severe performance penalty. There is a serious problem with this parallel map: it is not lazy. It will evaluate an entire list, whether the result is needed or not. For example, try this code instead of the above. They should be equivalent:

answer = take 10 $ (parMap rwhnf) (length . factors) bigNums

Be careful running this--it may eat all your memory. The code will not never finish, as it tries to find every factor of each number in the infinite list


We can write another version of parallel map ourselves using

to evaluate two list elements at a time. The function
par :: a -> b -> b
, and it is run with two arguments. It puts the first argument in a queue to be evaluated if there are free threads available. The second argument is returned. This is a win where the second argument uses the first argument, as in
par y [x,y]
, where
are effectively evaluated at the same time (if the function is run in a multithreaded environment) and
is returned.

First, this would be the naive algorithm to map over a list, two elements at a time:

map2 :: (a -> b) -> [a] -> [b] map2 _ [] = [] map2 f [x] = [f x] map2 f (x:y:ys) = f x : f y : map2 f ys

To take advantage of

, we only need to modify it slightly:

map2 :: (a -> b) -> [a] -> [b] map2 _ [] = [] map2 f [x] = [f x] map2 f (x:y:ys) = fy `par` f x : fy : map2 f ys where fy = f y

I define

to ensure that
f y
is shared--its value is only calculated once, while writing
f y
in multiple places might be shared, or it might not--this much more likely to be respected by the compiler.

The following version looks messier, but it will take advantage of up to four threads:

mapP :: (a -> b) -> [a] -> [b] mapP _ [] = [] mapP f (w:[]) = [f w] mapP f (w:x:[]) = fx `par` [f w,fx] where fx = f x mapP f (w:x:y:[]) = fx `par` fy `par` [f w,fx,fy] where fx = f x fy = f x mapP f (w:x:y:z:zs) = fx `par` fy `par` fz `par` (f w:fx:fy:fz : (mapP f zs)) where fx = f x fy = f y fz = f z

can be used as a drop-in replacement for
, though this would only be advisible when a slow computation is being performed. It is fairly easy to rewrite any recursive algorithm in the way we have rewritten

Somebody pointed out to me that

can also be used to structure (mostly) lazy parallel computations. Its type signature is:
parBuffer :: Int -> Strategy a -> [a] -> [a]

It sparks the evaluation of n elements of a list, where you decide n, and also the evaluation strategy. (This decides how the elements will be evaluated. For example, a list being evaluated is different than its elements being evaluated, and you may need to write a strategy if you want a data structure to be evaluated in a certain way. I liked the explanation of strategies here. I've chosen the strategy
.) One last (and easy) definition of
could be:

mapP f xs = parBuffer 4 rwhnf $ map f xs

though this is not faster than the above version, and may be slightly slower, in some cases. (If you look at parBuffer's source in GHC, there is a lot more recursion than in the method defined above.) However, the built in method is a big win if the number of sparked computations needs to be set at runtime, rather than compilation.

E-mail me at dzwell@[this domain].