Haskell hackery
Parallel, lazy, map
Every functional programmer uses map. It is probably one of the most commonly used functions. Wouldn't it be nice if we could process a few elements of a list at a time? First, here is a motivation--we want to map a computationally expensive function over a list. Say we want to factor a bunch of numbers. Here is a nice and slow function that factors a number:
factors n = let candidates = [2..floor (sqrt (fromInteger n))]
in catMaybes $ map (\x -> if n `mod` x == 0
then Just (x, n `div` x)
else Nothing) candidates
Let's calculate how many pairs of factors ten really large numbers have:
bigNums = [2000000000000..]
answer = map (length . factors) (take 10 bigNums)
main = print answer
You need to import Data.Maybe to compile this. It's not incredibly fast:
$ time ./a.out
[90,7,7,0,5,63,7,11,31,3]
real 0m5.634s
user 0m5.600s
sys 0m0.033s
The computation for each number in
take 10 bigNumsis independent, so wouldn't it make sense to do them, say, two or four at a time, if we are on a dual or quad core machine? Haskell has a pretty straightforward mechanism of parallelism with which we can construct a parallel variant of
map:
$ cat primes-test.hs
module Main where
import Data.Maybe
import Control.Parallel.Strategies
import Control.Parallel
factors n = let candidates = [2..floor (sqrt (fromInteger n))]
in catMaybes $ map (\x ->
if n `mod` x == 0
then Just (x, n `div` x)
else Nothing) candidates
bigNums = [2000000000000..]
answer = (parMap rwhnf) (length . factors) (take 10 bigNums)
main = print answer
$ ghc primes-test --make -threaded
[1 of 1] Compiling Main ( primes-test.hs, primes-test.o )
Linking primes-test ...
Let's compare the runtimes with and without threads enabled (they must be turned on with the command line parameter "-threaded"):
$ time ./primes-test
[90,7,7,0,5,63,7,11,31,3]
real 0m5.747s
user 0m5.706s
sys 0m0.033s
$ time ./primes-test +RTS -N2
[90,7,7,0,5,63,7,11,31,3]
real 0m3.308s
user 0m5.840s
sys 0m0.100s
That is almost a 50% speed-up, but enabling more threads than I have cores results in a severe performance penalty. There is a serious problem with this parallel map: it is not lazy. It will evaluate an entire list, whether the result is needed or not. For example, try this code instead of the above. They should be equivalent:
answer = take 10 $ (parMap rwhnf) (length . factors) bigNums
Be careful running this--it may eat all your memory. The code will not never finish, as it tries to find every factor of each number in the infinite list
[2000000000000..].
We can write another version of parallel map ourselves using
parto evaluate two list elements at a time. The function
paris
par :: a -> b -> b, and it is run with two arguments. It puts the first argument in a queue to be evaluated if there are free threads available. The second argument is returned. This is a win where the second argument uses the first argument, as in
par y [x,y], where
xand
yare effectively evaluated at the same time (if the function is run in a multithreaded environment) and
[x,y]is returned.
First, this would be the naive algorithm to map over a list, two elements at a time:
map2 :: (a -> b) -> [a] -> [b]
map2 _ [] = []
map2 f [x] = [f x]
map2 f (x:y:ys) = f x : f y : map2 f ys
To take advantage of
par, we only need to modify it slightly:
map2 :: (a -> b) -> [a] -> [b]
map2 _ [] = []
map2 f [x] = [f x]
map2 f (x:y:ys) = fy `par` f x : fy : map2 f ys
where fy = f y
I define
fyto ensure that
f yis shared--its value is only calculated once, while writing
f yin multiple places might be shared, or it might not--this much more likely to be respected by the compiler.
The following version looks messier, but it will take advantage of up to four threads:
mapP :: (a -> b) -> [a] -> [b]
mapP _ [] = []
mapP f (w:[]) = [f w]
mapP f (w:x:[]) = fx `par` [f w,fx]
where fx = f x
mapP f (w:x:y:[]) = fx `par` fy `par` [f w,fx,fy]
where fx = f x
fy = f x
mapP f (w:x:y:z:zs) = fx `par` fy `par` fz `par` (f w:fx:fy:fz : (mapP f zs))
where fx = f x
fy = f y
fz = f z
mapPcan be used as a drop-in replacement for
map, though this would only be advisible when a slow computation is being performed. It is fairly easy to rewrite any recursive algorithm in the way we have rewritten
map.
Somebody pointed out to me that
parBuffercan also be used to structure (mostly) lazy parallel computations. Its type signature is:
parBuffer :: Int -> Strategy a -> [a] -> [a]
It sparks the evaluation of n elements of a list, where you decide n, and also the evaluation strategy. (This decides how the elements will be evaluated. For example, a list being evaluated is different than its elements being evaluated, and you may need to write a strategy if you want a data structure to be evaluated in a certain way. I liked the explanation of strategies here. I've chosen the strategy
rwhnf.) One last (and easy) definition of
mapPcould be:
mapP f xs = parBuffer 4 rwhnf $ map f xs
though this is not faster than the above version, and may be slightly slower, in some cases. (If you look at parBuffer's source in GHC, there is a lot more recursion than in the method defined above.) However, the built in method is a big win if the number of sparked computations needs to be set at runtime, rather than compilation.
E-mail me at dzwell@[this domain].