osa1 github about atom

Separating lexing and parsing stages in Parsec

August 30, 2012 - Tagged as: haskell, en.

I really love Parsec. After using it several months now, I can’t think of parsing anything other than Parsec.

After I started working on more complex grammars, I thought maybe it could be good idea to separate lexing stage, since it may lead parser with simpler code. After some experimenting, a SO question, and reading some part of Parsec’s source, I finally managed to separate lexing and parsing. Now I’ll explain how to do that.

In lexing stage, other than writing ordinary token parsers, we also need to handle token positions. Each token moves the cursor, and saving this is needed for error reporing(and maybe some other reasons). In our case, simplest tokens are characters, and since we’re using Parsec’s built-in Char token, we don’t need to handle characters’ positions. But we need to save each token’s positions because we won’t be using Char tokens in parsing stage, we will be using our custom tokens.

You can see the full source here.

data Token = Ide String
           | LBrack
           | RBrack
           | LBrace
           | RBrace
           | Keyword String
    deriving (Show, Eq)

Token types should be instances of Eq to be able to test for equality while parsing, and Show to be able to print in error situations(actually you can use any function for testing for equality and printing, but I find this way easier).

type TokenPos = (Token, SourcePos)

So we will be using (Token, SourcePos) pairs for tokens with positions of them in source. Now parsers are simple:

ide :: Parser TokenPos
ide = do
    pos <- getPosition
    fc  <- oneOf firstChar
    r   <- optionMaybe (many $ oneOf rest)
    spaces
    return $ flip (,) pos $ case r of
                 Nothing -> Ide [fc]
                 Just s  -> Ide $ [fc] ++ s
  where firstChar = ['A'..'Z'] ++ ['a'..'z'] ++ "_"
        rest      = firstChar ++ ['0'..'9']

This is a simple identifier parser(or lexer). Note the pos <- getPosition part.

After more lexers like this, we need a tokenize function to generate token stream:

tokenize :: SourceName -> String -> Either ParseError [TokenPos]
tokenize = runParser tokens ()

In parsing stage, we have several problems. One is that now we can’t use Parsec’s parseTest function, which I almost always use for testing purposes. Because now we need to pass the string to lexer, and then pass it’s output to parser. So I wrote this:

import Text.Parsec as P
...
parseTest  :: Show a => Parsec [TokenPos] () a -> String -> IO ()
parseTest p s =
    case tokenize "test" s of
        Left e    -> putStrLn $ show e
        Right ts' -> P.parseTest p ts'

satisfy gets a predicate on token, and return a token parser using Parsec’s tokenPrim function, which takes 3 functions as parameters, one for printing the token(to be used in error reporting), one for updating the current position, and one for returning the result after calling predicate. I adapted this function from Parsec’s string parsers:

advance :: SourcePos -> t -> [TokenPos] -> SourcePos
advance _ _ ((_, pos) : _) = pos
advance pos _ [] = pos
satisfy :: (TokenPos -> Bool) -> Parser Token
satisfy f = tokenPrim show
                      advance
                      (\c -> if f c then Just (fst c) else Nothing)

Last function, tok, takes a token and returns a token parser:

tok :: Token -> Parser Token
tok t = (satisfy $ (== t) . fst) <?> show t

.. <?> show t part is required for error reporting. See my SO question for details.

Now, when using Parsec, generally there’s no need to separate lexing stage. Because as you can see above, it just makes code more complex, and adds no more flexibility. Instead of writing tok $ Ide "some-identifier", you can always write ide "some-identifier" with a ide parser.

Still, I think understanding how to make this helps for at least two reasons. First, now you can work on any streams, not just character streams using Parsec’s Char token. And second, writing lexers can still help in some situations, like parsing indentation-based grammars. In that case, you can generate indent-dedent tokens in lexing stage, and make parser code more clean because it will be less context-dependent (I never tried that with Parsec, though) .