March 7, 2014 - Tagged as: en, haskell, parsing.
I recently realized that it may be possible to handle left-recursion in Parsec style parser combinator libraries. I quickly wrote a simple prototype implementation that demonstrates the idea.
The idea is to keep track of parser functions that are called without consuming any tokens from input stream. You should be able to run failure procedures when same parser function is encountered more than one time without consuming any tokens. When a token is consumed, state that is used to keep track of parser functions should be reset.
One assumption here is that I’m assuming parser functions do not alter any states. Otherwise when you come across a same parser, you can have different state and parser may behave differently.
Implementing the idea is easy even without having any extra support from Parsec. Here’s a demonstration on my favorite ambiguous grammar:
-- Exp ::= Int
-- | Exp `+` Exp
data Exp = Int Int
| Add Exp Exp
deriving (Show)
Parsers for non-terminals are easy:
= char '+' >> spaces
plus = fmap (Int . read) $ many1 digit <* spaces int
I’m using a set to keep track of already visited parsers:
type MarksSeen = S.Set Int
type RecParser s m a = ParsecT s MarksSeen m a
Auxiliary functions to alter the state that keeps track of visited parsers:
putMark i = do
is <- getState
if S.member i is
then fail "recursion"
else putState $ S.insert i is
resetMarks = putState S.empty
Now the interesting part, add
parser marks itself as first thing to do, and calls exp
parser. Since exp
parser is entry point, this means an indirect recursive call. When same putMark
call is made, Parsec runs failure actions instead of going into infinte loop:
= do
add 1
putMark <- exp
e1
resetMarks
plus<- exp
e2
spacesreturn $ Add e1 e2
exp = choice [try add, int]
resetMarks
call is also important, exp
has to consume some tokens, so after parsing e1
, I’m calling resetMarks
.
Here’s an example call of this parser:
ghci> runParser exp S.empty "" "1 + 2 + 3 + 4"
Right (Add (Int 1) (Add (Int 2) (Add (Int 3) (Int 4))))
You can observe that parser gets into an infinite loop when marks are removed. Here’s an example demonstrating the error message:
= putMark 0 >> rec >> resetMarks
rec
> runParser rec S.empty "" ""
ghciLeft (line 1, column 1):
recursion
One problem with this approach is that it requires more typing, and you should be careful too. Marks can be placed using TemplateHaskell to ensure unique numbers are given to each putMark
call. As a second improvement, I think with some modifications on Parsec we can make Parsec to reset marks when a token is consumed(using consumed-ok continuation of ParsecT
).
You can see the complete program here.
Removing left-recursions in your grammar may not be a huge problem – except when you’re working on functional languages with ML-like syntax. Then you’re out of luck because being functional means you’re Exp
non-terminal contains several dozen of productions and function applications is a part of that too, and it’s left-recursive:
Exp ::= ...
| Exp Exp_1 ... Exp_N [left-associative]
... a hundred more productions ...
Whenever I need to write a parser for a grammar like this, I’m thinking for an easier way to parse it. I still couldn’t come up with a solution. Idea I just explained does not solve it, because it parses right-associatively. There is one workaround, but I’m not sure if that results with a parser for same grammar:
= do
app 2
putMark <- exp
fn
resetMarks
2
putMark <- many1 exp
as
resetMarks
return $ foldl App fn as
> runParser pgm S.empty "" "1 2 3 4 + 5 + 6 7"
ghciRight (Add (App (App (App (Int 1) (Int 2)) (Int 3)) (Int 4)) (Add (Int 5) (App (Int 6) (Int 7))))
> runParser pgm S.empty "" "1 + 2 + 3 4"
ghciRight (Add (Int 1) (Add (Int 2) (App (Int 3) (Int 4))))