Nottingham FP Lab Blog

Folding Statistics

by Tom Nielsen on July 17, 2009.
Tagged as: Lunches.

Today, I talked about a statistics library I am developing for Haskell that has two advantages over existing libraries on hackage: single-pass statistical calculations with O(1) memory requirements are composable, and they can be applied to any collection type for which we can implement a fold.

A statistic is a usually a real valued function defined for a collection of observations, but lets generalise a bit and make it polymorphic in the type of the elements of the collection and the result type. We could represent a statistic (e.g. mean, standard deviation) as a function [a] -> b, but that would have two disadvantages: we are stuck with lists, and if we want to define new statistics (for instance, the coefficient of variation, which is defined as the ratio of the standard deviation to the mean) from old, we will traverse the collection at least twice – at least once for each statistic.

The trick to solving this is to use what Max Rabkin calls Beautiful Folding, which is apparently similar to a banana split. We define

data Fold b c = forall a. Fold (a -> b -> a) a (a -> c)

and define a “both” combinator with type Fold b c -> Fold b d -> Fold b (c,d). We can now write the functor instance for Fold:

fmap f (Fold op init k) = Fold op init (f . k)

and applicative

return x = Fold (\_ _ -> ()) () $ const x
f <*> g = uncurry ($) `fmap` f `both` g

as suggested by Twan van Laarhoven. We can now write simple definitions

sumF = Fold (+) 0 id
lengthF = realToFrac `fmap` Fold (\n _ -> n+1) 0 id

meanF = (/) <*> sumF <*> lengthF

Let’s make a small change to the definition of Fold

data Fold b c = forall a. Fold (a -> b -> a) a (a -> c) (a->a->a)

The last field is a function that tells me how to combine two folds. I have to update my definition of functor and applicative instances, which is trivial. The primitive statistics aren’t much more complicated:

sumF = Fold (+) 0 id (+)
lengthF = realToFrac <$> Fold (\n _ -> n+1) 0 id (+)

But now, if I know how to split my input collection type into chunks, I get parallel statistics, almost for free. I can split my observations into N chunks, fold each chunk on a seperate processor, and then combine the folds before running the continuation.