osa1

An idea to handle left-recursion in Parsec

March 7, 2014 - Tagged as: en, haskell, parsing.

I recently realized that it may be possible to handle left-recursion in Parsec style parser combinator libraries. I quickly wrote a simple prototype implementation that demonstrates the idea.

The idea is to keep track of parser functions that are called without consuming any tokens from input stream. You should be able to run failure procedures when same parser function is encountered more than one time without consuming any tokens. When a token is consumed, state that is used to keep track of parser functions should be reset.

One assumption here is that I’m assuming parser functions do not alter any states. Otherwise when you come across a same parser, you can have different state and parser may behave differently.

Implementing the idea is easy even without having any extra support from Parsec. Here’s a demonstration on my favorite ambiguous grammar:

-- Exp ::= Int
--      |  Exp `+` Exp

data Exp = Int Int
         | Add Exp Exp
         deriving (Show)

Parsers for non-terminals are easy:

plus = char '+' >> spaces
int  = fmap (Int . read) $ many1 digit <* spaces

I’m using a set to keep track of already visited parsers:

type MarksSeen = S.Set Int

type RecParser s m a = ParsecT s MarksSeen m a

Auxiliary functions to alter the state that keeps track of visited parsers:

putMark i = do
    is <- getState
    if S.member i is
      then fail "recursion"
      else putState $ S.insert i is

resetMarks = putState S.empty

Now the interesting part, add parser marks itself as first thing to do, and calls exp parser. Since exp parser is entry point, this means an indirect recursive call. When same putMark call is made, Parsec runs failure actions instead of going into infinte loop:

add = do
    putMark 1
    e1 <- exp
    resetMarks
    plus
    e2 <- exp
    spaces
    return $ Add e1 e2

exp = choice [try add, int]

resetMarks call is also important, exp has to consume some tokens, so after parsing e1, I’m calling resetMarks.

Here’s an example call of this parser:

ghci> runParser exp S.empty "" "1 + 2 + 3 + 4"
Right (Add (Int 1) (Add (Int 2) (Add (Int 3) (Int 4))))

You can observe that parser gets into an infinite loop when marks are removed. Here’s an example demonstrating the error message:

rec = putMark 0 >> rec >> resetMarks

ghci> runParser rec S.empty "" ""
Left (line 1, column 1):
recursion

One problem with this approach is that it requires more typing, and you should be careful too. Marks can be placed using TemplateHaskell to ensure unique numbers are given to each putMark call. As a second improvement, I think with some modifications on Parsec we can make Parsec to reset marks when a token is consumed(using consumed-ok continuation of ParsecT).

You can see the complete program here.

Removing left-recursions in your grammar may not be a huge problem – except when you’re working on functional languages with ML-like syntax. Then you’re out of luck because being functional means you’re Exp non-terminal contains several dozen of productions and function applications is a part of that too, and it’s left-recursive:

Exp ::= ...
     |  Exp Exp_1 ... Exp_N [left-associative]
     ... a hundred more productions ...

Whenever I need to write a parser for a grammar like this, I’m thinking for an easier way to parse it. I still couldn’t come up with a solution. Idea I just explained does not solve it, because it parses right-associatively. There is one workaround, but I’m not sure if that results with a parser for same grammar:

app = do
    putMark 2
    fn <- exp
    resetMarks

    putMark 2
    as <- many1 exp
    resetMarks

    return $ foldl App fn as


ghci> runParser pgm S.empty "" "1 2 3 4 + 5 + 6 7"
Right (Add (App (App (App (Int 1) (Int 2)) (Int 3)) (Int 4)) (Add (Int 5) (App (Int 6) (Int 7))))

ghci> runParser pgm S.empty "" "1 + 2 + 3 4"
Right (Add (Int 1) (Add (Int 2) (App (Int 3) (Int 4))))

(Show comments)