Sunday, August 21, 2011

Haskell Arrays vs. Efficincy

If you are new to the Haskell computer language, you ought to read as many guides, tutorials, and cookbooks as you can. I did this myseful, but I unfortunately never really understood how to use arrays. Especially coming from a C/C++/Java background. In traditional languages, Arrays are used as a data structure with O(1) constant time reads/writes of elements.

But in Haskell arrays are NOT constant-time updatable, and the lazy-evaluated nature of Haskell is largely to blame for this "problem". But there is more to it than that. The implications of a lazy-evaluated language are often difficult to grasp, but in a nutshell, it allows you to express your program as a mathematical equation so the compiler can "prove" your equations are correct and efficient. The compiler can therefore (ususally) choose the most efficient low-level implementation of your algorithm. So, using "Data.Array.IO" to create arrays, and to do all read/updates in the IO monad is NOT necessarily going to produce the fastest executable program.

Using the IO monad basically forces execution to happen in a certain order. This means the compiler is unable to employ a vast range of optimizing strategies on IO computations. Using an unboxed array in the IO monad, you may gain some efficiency from constant-time updates, but you have likely lost efficiency because the compiler can no longer prove that all the steps leading up to each write operation were as efficient as they could have been.

For example, what if your code does this:

main = do
    a <- newArray (0,1024) 0 :: IOUArray Int IO Int
    writeArray a 0 999

something_else = do
    writeArray a 0 555

The "something_else" function overwrites the value written by the "main". But, it happened in the IO monad, so the compiler promises it will execute both writes, so it is obliged to not optimize away that first write from "main". Had your code been pure (i.e. not in the IO monad) the compiler may have been able to prove that the first array operation is always unnecessary. If the "main" function were some other function that is executed a million times, that would cause quite a performance hit. Of course, this is an oversimplified example, and your Haskell compiler may well be clever enough to optimize-out the needless array writes. But this problem becomes especially hard to catch as programs get larger and more complex.

So, keep these things in mind:

  • Arrays are simply mappings from integers to elements.

  • Arrays are similar to "Data.IntMap"

  • Only use arrays if you want constant-time reads, not writes.

  • For histograms, use "Data.Array.Diff" which has Array structures optimized for that purpose.

  • Use an "IntMap" for fast writes of arbitrary elements.

  • Use "Control.Monad.State.Lazy" with an "IntMap" or "DiffArray" as the state.

  • Try to structure your program so all the writes are done in one broad stroke, and then use "Data.IntMap" for writing. Then, when you need not make anymore updates, "freeze" your IntMap into an immutable array and let the rest of your program enjoy constant time lookups.

    import MySetup (lots_of_writes)
    import Data.Array.IArray
    import Data.IntMap
    import Control.Monad.State.Lazy

    lb = 0 -- lower bound
    ub = 1024 -- upper bound

    main = do
      -- First do all "write" operations:
      let my_intmap = execState (lots_of_writes lb ub) (Data.IntMap.empty)

      -- Then "freeze" it, make an array for lots of reads:
      let my_array = array (lb, ub) (Data.IntMap.elems my_intmap)

      -- Then do lots of reads:
      ixlist <- (readFile "./indecies" >>= return . read) :: IO [Int]
      mapM_ (\i -> putStrLn $ show (my_array ! i)) ixlist

No comments: