Jaguarpaw Blog - Why are these simple text processing programs so much faster than in C?

by Tom Ellis on 27th October 2012

Why are these simple text processing programs so much faster than in C?

(a rhetorical question)

Honza Pokorny set Haskell Reddit ablaze with his criticism of Haskell’s string-handling performance. It was discussed at length in one story and CharlesStain wrote a response with a proposed solution.

The benchmark left a lot vague. Python’s str.capitalize upper-cases the first letter in the string but lower-cases the rest, whereas his C and Haskell versions upper-case the initial letter and leave the rest unchanged; his C version forces lines to end after 1024 characters; and he does not specify character encoding. However, it’s worth pointing out that idiomatic Haskell can be astonishingly fast at this kind of task.

I generated a 175MB file of random words using a Python program:

import random

letters = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"

def text():
  for i in xrange(175 * 1024 * 1024):
    if random.random() < 0.01:
      yield "\n"
    elif random.random() < 0.1:
      yield " "
    else:
      yield letters[int(random.random() * len(letters))]

print ''.join(text())

C

This is Honza’s C version:

#include <stdio.h>
#include <string.h>

int main(void) {
    static const char filename[] = "file";
    FILE *file = fopen(filename, "r");

    if (file != NULL) {
        char line[1024];
        while(fgets(line, sizeof line, file) != NULL) {
        line[strcspn (line, "\n")] = '\0';

        int lengthOfLine = strlen(line);
        int word = 0;
        int i;

        for (i = 0; i < lengthOfLine; i++) {
            if (isalpha(line[i])) {
            if (!word) {
                line[i] = (char) toupper (line[i]);
                word = 1;
            }
            } else {
            word = 0;
            }
        }

        printf ("%s\n", line);
        }

        fclose(file);
    } else {
        perror(filename);
    }

    return 0;
}

On my machine the -O3 optimized version runs in 2.6 seconds. It iterates over line three times, once in strcspn and once in the for loop. I tried making it iterate only once, but that wasn’t actually any faster.

Haskell

Byte strings

Here is a simplification of CharlesStain’s version:

import Control.Monad
import qualified Data.ByteString as B
import Data.Word8

{- | Converts a Char to a Word8. Took from MissingH -}
c2w8 :: Char -> Word8
c2w8 = fromIntegral . fromEnum

capitalize :: B.ByteString -> B.ByteString
capitalize = B.tail . B.scanl (\a b -> if isSpace a then toUpper b else b) (c2w8 ' ')

main = (B.putStr <=< (return . capitalize) <=< B.readFile) "file"

On my machine the -O2 optimized version runs in 1.8 seconds. It is significantly faster than the C!

Foreign pointers

The ByteString-using version reads the whole file into memory. Haskell can do it 1024 bytes at a time too, but the only such way I found that maintains performance is using the hardcore mutability of foreign pointers:

import Data.Word8 (Word8, toUpper)
import Control.Monad (when)
import System.IO (Handle, openBinaryFile, IOMode(ReadMode), hGetBuf, hPutBuf, stdout)
import GHC.ForeignPtr (mallocPlainForeignPtrBytes, ForeignPtr)
import Foreign.ForeignPtr (withForeignPtr)
import Foreign.Storable (peekByteOff, pokeByteOff)
import Foreign.Ptr (Ptr)

main :: IO ()
main = do
  ptr <- mallocPlainForeignPtrBytes 1024
  handle <- openBinaryFile "file" ReadMode
  untilDone (readAndPrint handle ptr) False

readAndPrint :: Handle -> ForeignPtr Word8 -> Bool -> IO (Bool, More)
readAndPrint h ptr s = withForeignPtr ptr $ \p -> do
        count <- hGetBuf h p 1024
        s' <- capitalize s count p
	    hPutBuf stdout p count
	    return $ if count < 1024 then (s', Done) else (s', More)

capitalizeLetter :: Word8 -> Bool -> (Maybe Word8, Bool)
capitalizeLetter c previouslyWithinWord = if isAlpha_ c
                                          then if not previouslyWithinWord
                                               then (Just $ toUpper c, True)
                                               else (Nothing, True)
                                          else (Nothing, False)

capitalize :: Bool -> Int -> Ptr Word8 -> IO Bool
capitalize = mapAccumL_ capitalizeLetter

data More = More | Done deriving Eq

untilDone :: (s -> IO (s, More)) -> s -> IO ()
untilDone f = again
  where again s = do
          (s', more) <- f s
          when (more == More) (again s')

mapAccumL_ :: (Word8 -> s -> (Maybe Word8, s)) -> s -> Int -> Ptr Word8 -> IO s
mapAccumL_ f s_ total p = go 0 s_
  where go i s =  if i == total then return s
                  else do
                    old <- peekByteOff p i :: IO Word8
                    let (maybe_new, next_s) = f old s
                    case maybe_new of Nothing -> return ()
                                      Just new -> pokeByteOff p i new
                    go (i+1) next_s

isAlpha_ :: Word8 -> Bool
isAlpha_ c = ((65 <= c) && (c <= 90)) || ((97 <= c) && (c <= 122))

With -O2 it takes 2.1s, uses barely any memory, and no inlining is required. This code looks longer and more complicated than the C, but that is an illusion. It is full of simple combinators and utility functions, and there is no reason they, or something like them, should not be part of a standard library. (Perhaps they already are! This is my first time using Ptr and ForeignPtr). A special implementation of isAlpha_ is necessary for this to be fast, because the one provided by Data.Word8 is more complicated.

This version used to be as fast as CharlesStain’s version, until Verroq on Reddit pointed out a bug. Thanks Verroq!

Conclusion

Why are these simple text processing programs so much faster than in C? I don’t know. They do not all perform the same task, for one thing! The behaviour of C’s toupper depends on the locale, so perhaps that is making it slower than it would be if it had exactly the same behaviour as the Haskell versions.

On Reddit, Syzygies has produced a neat piece of Haskell which is even faster. Someone enthusiastic can wring every last drop of performance out of C and make it faster than Haskell. However that’s not the point. The point is that writing simple and idiomatic Haskell is productive and a pleasure. We should not, and do not, have to sacrifice performance.