Nottingham FP Lab Blog

Parser combinators are as expressive as possible

by Nils Anders Danielsson on October 28, 2009.
Tagged as: Lunches.

What is the expressive strength of parser combinators? Which languages can be expressed using them? The question depends on the definition of parser combinators used. In our last meeting I showed that a simple parser combinator library which I have implemented in Agda is as expressive as possible: it can be used to recognise exactly those languages (sets of bit strings) for which a decision procedure can be implemented in Agda.

Consider the following language of parser (or perhaps recogniser) combinators:

data P : Bool → Set where
  ∅   : P false
  ε   : P true
  tok : Bool → P false
  _∣_ : ∀ {n₁ n₂} → P n₁ →        P n₂  → P (n₁ ∨ n₂)
  _·_ : ∀ {n₁ n₂} → P n₁ → ∞? n₁ (P n₂) → P (n₁ ∧ n₂)

The parsers are indexed on their nullability; the index is true iff the parser accepts the empty string. The first four combinators are unsurprising: failure, the recogniser for the empty string, the recogniser for a given token, and symmetric choice. The last combinator, sequencing, is more interesting: the second argument is allowed to be coinductive if the first argument does not accept the empty string:

∞? : Bool → Set → Set
∞? true  A =   A
∞? false A = ∞ A

(Read ∞ A as a thunk, i.e. a suspended computation of type A.) This ensures that grammars can be infinite, but only in a controlled way. If you check out the code you will find a formal semantics for these combinators, and a (verified and total) procedure which decides whether a given string is a member of a given language defined using the combinators. This shows that every language definable using the combinators is decidable. (For more details, see Mixing Induction and Coinduction.)

What about the other direction? Given a decision procedure of type List Bool → Bool a corresponding grammar can be produced. The idea is to build an infinite tree with one “node” for every possible input string, and to make this “node” accepting iff the decision procedure accepts the given string:

accept-if-true : ∀ n → P n
accept-if-true true  = ε
accept-if-true false = ∅

grammar : (f : List Bool → Bool) → P (f [])
grammar f = tok true  · ♯ grammar (f ∘ _∷_ true )
          ∣ tok false · ♯ grammar (f ∘ _∷_ false)
          ∣ accept-if-true (f [])

(Read ♯_ as “delay”, i.e. as the constructor of a thunk.) For a formal proof of the correctness of the grammar, see the Agda code. I have also generalised the proof to proper parser combinators, for which the statement is that, for every R, every function of type List Bool → List R can be realised using parser combinators (if set equality is used for the list of results); if the bind combinator is allowed, then the result easily generalises to arbitrary token types.

The language defined by grammar f above can be parsed using deterministic recursive descent with one token of lookahead; grammar f could be viewed as an “infinite LL(1)” grammar. I suspect that this grammar scheme can be implemented using many parser combinator libraries. I also suspect that the result above is already known. It has been known for a long time that a potentially infinite grammar can represent any countable language (I found this mentioned in Marvin Solomon’s PhD thesis from 1977), and given that fact the result above is only a small step away. Pointers to the literature are welcome.