I really love Parsec. After using it several months now, I can’t think of parsing anything other than Parsec.
After I started working on more complex grammars, I thought maybe it could be good idea to separate lexing stage, since it may lead parser with simpler code. After some experimenting, a SO question, and reading some part of Parsec’s source, I finally managed to separate lexing and parsing. Now I’ll explain how to do that.
In lexing stage, other than writing ordinary token parsers, we also need to handle token positions. Each token moves the cursor, and saving this is needed for error reporing(and maybe some other reasons). In our case, simplest tokens are characters, and since we’re using Parsec’s built-in
Char token, we don’t need to handle characters’ positions. But we need to save each token’s positions because we won’t be using Char tokens in parsing stage, we will be using our custom tokens.
You can see the full source here.
data Token = Ide String | LBrack | RBrack | LBrace | RBrace | Keyword String deriving (Show, Eq)
Token types should be instances of
Eq to be able to test for equality while parsing, and
Show to be able to print in error situations(actually you can use any function for testing for equality and printing, but I find this way easier).
type TokenPos = (Token, SourcePos)
So we will be using
(Token, SourcePos) pairs for tokens with positions of them in source. Now parsers are simple:
ide :: Parser TokenPos ide = do pos <- getPosition fc <- oneOf firstChar r <- optionMaybe (many $ oneOf rest) spaces return $ flip (,) pos $ case r of Nothing -> Ide [fc] Just s -> Ide $ [fc] ++ s where firstChar = ['A'..'Z'] ++ ['a'..'z'] ++ "_" rest = firstChar ++ ['0'..'9']
This is a simple identifier parser(or lexer). Note the
pos <- getPosition part.
After more lexers like this, we need a
tokenize function to generate token stream:
tokenize :: SourceName -> String -> Either ParseError [TokenPos] tokenize = runParser tokens ()
In parsing stage, we have several problems. One is that now we can’t use Parsec’s
parseTest function, which I almost always use for testing purposes. Because now we need to pass the string to lexer, and then pass it’s output to parser. So I wrote this:
import Text.Parsec as P ... parseTest :: Show a => Parsec [TokenPos] () a -> String -> IO () parseTest p s = case tokenize "test" s of Left e -> putStrLn $ show e Right ts' -> P.parseTest p ts'
satisfy gets a predicate on token, and return a token parser using Parsec’s
tokenPrim function, which takes 3 functions as parameters, one for printing the token(to be used in error reporting), one for updating the current position, and one for returning the result after calling predicate. I adapted this function from Parsec’s string parsers:
advance :: SourcePos -> t -> [TokenPos] -> SourcePos advance _ _ ((_, pos) : _) = pos advance pos _  = pos satisfy :: (TokenPos -> Bool) -> Parser Token satisfy f = tokenPrim show advance (\c -> if f c then Just (fst c) else Nothing)
tok, takes a token and returns a token parser:
tok :: Token -> Parser Token tok t = (satisfy $ (== t) . fst) <?> show t
.. <?> show t part is required for error reporting. See my SO question for details.
Now, when using Parsec, generally there’s no need to separate lexing stage. Because as you can see above, it just makes code more complex, and adds no more flexibility. Instead of writing
tok $ Ide "some-identifier", you can always write
ide "some-identifier" with a
Still, I think understanding how to make this helps for at least two reasons. First, now you can work on any streams, not just character streams using Parsec’s
Char token. And second, writing lexers can still help in some situations, like parsing indentation-based grammars. In that case, you can generate indent-dedent tokens in lexing stage, and make parser code more clean because it will be less context-dependent (I never tried that with Parsec, though) .