The Scanner Business

Question: Why doesn't this compile?

    rule<> r = /*...*/;
    parse("hello world", r, space_p); // BAD [attempts phrase level parsing]

But if I remove the skip-parser, everything goes back to normal again:

    rule<> r = *anychar_p;
    parse("hello world", r); // OK [character level parsing]

Sometimes you'll want to pass in a rule to one of the functions parse functions that Spirit provides. The problem is that the rule is a template class that is parameterized by the scanner type. This is rather awkward but unavoidable: the rule is tied to a scanner. What's not obvious is that this scanner must be compatible with the scanner that is ultimately passed to the rule's parse member function. Otherwise, the compiler will complain.

Why does the first call to parse not compile? Because of scanner incompatibility. Behind the scenes, the free parse function creates a scanner from the iterators passed in. In the first call to parse, the scanner created is a plain vanilla scanner<>. This is compatible with the default scanner type of rule<> [see default template parameters of the rule]. The second call creates a scanner of type phrase_scanner_t. Thus, in order for the second call to succeed, the rule must be parameterized as rule<phrase_scanner_t>:

    rule<phrase_scanner_t> r = *anychar_p;
    parse("hello world", r, space_p);       //  OK [phrase level parsing]

Take note however that phrase_scanner_t is compatible only when you are using char const* iterators and space_p as the skip parser. Other than that, you'll have to find the right type of scanner. This is tedious to do correctly. In light of this issue, it is best to avoid rules as arguments to the parse functions. Keep in mind that this happens only with rules. The rule is the only parser that has to be tied to a particular scanner type. For instance:

    parse("hello world", *anychar_p);           //  OK  [character level parsing]
    parse("hello world", *anychar_p, space_p);  //  OK  [phrase level parsing
Multiple Scanner Support

As of v1.8.0, rules can use one or more scanner types. There are cases, for instance, where we need a rule that can work on the phrase and character levels. Rule/scanner mismatch has been a source of confusion and is the no. 1 FAQ. To address this issue, we now have multiple scanner support.

See the techniques section for an example of a grammar using a multiple scanner enabled rule, lexeme_scanner and as_lower_scanner.

Eliminating Left Recursion

Question: I ported a grammar from YACC. It's "kinda" working - the parser itself compiles with no errors. But when I try to parse, it gives me an "invalid page fault". I tracked down the problem to this grammar snippet:

    or_expr = xor_expr | (or_expr >> VBAR >> xor_expr);

What you should do is to eliminate direct and indirect left-recursion. This causes the invalid page fault because the program enters an infinite loop. The code above is good for bottom up parsers such as YACC but not for LL parsers such as Spirit.

This is similar to a rule in Hartmut Kaiser's C parser (this should be available for download from Spirit's site as soon as you read this).

    = exclusive_or_expression
    | inclusive_or_expression >> OR >> exclusive_or_expression

Transforming left recursion to right recursion, we have:

    = exclusive_or_expression >> inclusive_or_expression_helper

    = OR >> exclusive_or_expression >> inclusive_or_expression_helper
    | epsilon_p

I'd go further. Since:

    r = a | epsilon_p;

is equivalent to:

    r = !a;

we can simplify inclusive_or_expression_helper thus:

    = !(OR >> exclusive_or_expression >> inclusive_or_expression_helper)

Now, since:

    r = !(a >> r);

is equivalent to:

    r = *a;

we have:

    = *(OR >> exclusive_or_expression)

Now simplifying inclusive_or_expression fully, we have:

    = exclusive_or_expression >> *(OR >> exclusive_or_expression)

Reminds me of the calculators. So in short:

    a = b | a >> op >> b;

in pseudo-YACC is:

    a = b >> *(op >> b);

in Spirit. What could be simpler? Look Ma, no recursion, just iteration.

The lexeme_d directive and rules

Question: Does lexeme_d not support expressions which include rules? In the example below, the definition of atomicRule compiles,

    rule<phrase_scanner_t> atomicRule
        = lexeme_d[(alpha_p | '_') >> *(alnum_p | '.' | '-' | '_')];

but if I move alnum_p | '.' | '-' | '_' into its own rule, the compiler complains about conversion from const scanner<...> to const phrase_scaner_t&.

    rule<phrase_scanner_t> ch 
        = alnum_p | '.' | '-' | '_';

    rule<phrase_scanner_t> compositeRule
        = lexeme_d[(alpha_p | '_') >> *(ch)]; // <- error source

You might get the impression that the lexeme_d directive and rules do not mix. Actually, this problem is related to the first FAQ entry: The Scanner Business. More precisely, the lexeme_d directive and rules with incompatible scanner types do not mix. This problem is more subtle. What's causing the scanner incompatibility is the directive itself. The lexeme_d directive transforms the scanner it receives into something that disables the skip parser. This non-skipping scanner, unfortunately, is incompatible with the original scanner before transformation took place.

The simplest solution is not to use rules in the lexeme_d. Instead, you can definitely apply lexeme_d to subrules and grammars if you really need more complex parsers inside the lexeme_d. If you really must use a rule, you need to know the exact scanner used by the directive. The lexeme_scanner metafunction is your friend here. The example above will work as expected once we give the ch rule a correct scanner type:

    rule<lexeme_scanner<phrase_scanner_t>::type> ch 
        = alnum_p | '.' | '-' | '_';

Note: make sure to add "typename" before lexeme_scanner when this is used inside a template class or function.

The same thing happens when rules are used inside the as_lower_d directive. In such cases, you can use the as_lower_scanner. See the lexeme_scanner and as_lower_scanner.

See the techniques section for an example of a grammar using a multiple scanner enabled rule, lexeme_scanner and as_lower_scanner.

Kleene Star infinite loop

Question: Why Does This Loop Forever?

    rule<> optional = !(str_p("optional"));
    rule<> list_of_optional = *optional;

The problem with this is that the kleene star will continue looping until it gets a no-match from it's enclosed parser. Because the optional rule is optional, it will always return a match. Even if the input doesn't match "optional" it will return a zero length match. list_of_optional will keep calling optional forever since optional will never return a no-match. So in general, any rule that can be "nullable" (meaning it can return a zero length match) must not be put inside a kleene star.

Boost CVS and Spirit CVS

Question: There is Boost CVS and Spirit CVS. Which is used for further development of Spirit?

Generally, development takes place in Spirit's CVS. However, from time to time a new version of Spirit will be integrated in Boost. When this happens development takes place in the Boost CVS. There will be announcements on the Spirit mailing lists whenever the status of the Spirit CVS changes.

How to reduce compilation times with complex Spirit grammars

Question: Are there any techniques to minimize compile times using spirit? For simple parsers compile time doesn't seem to be a big issue, but recently I created a parser with about 78 rules and it took about 2 hours to compile. I would like to break the grammar up into smaller chunks, but it is not as easy as I thought it would be because rules in two grammar capsules are defined in terms of each other. Any thoughts?

The only way to reduce compile times is

The first task is merely logistical, the second is rather a technical one.

A good example of solving the first task is given in the Spirit cpp_lexer example written by JCAB.

The cross referencing problems may be solved by some kind of forward declaration, or, if this does not work, by introducing some dummy template argument to the non-templated grammars. Thus allows the instantiation time to be defered until the compiler has seen all the defintions:

    template <typename T = int>

template <typename T = int>
struct grammar1 : public grammar<grammar1>
{ // refers to grammar2<> }; template <typename T> struct grammar2 : public grammar<grammar2> { // refers to grammar1<> }; //... grammar1<> g; // both grammars instantiated here

The second task is slightly more complex. You must ensure that in the first compilation unit the compiler sees only some function/template declaration and in the second compilation unit the function/template definition. Still no problem, if no templates are involved. If templates are involved, you need to manually (explicitly) instantiate these templates with the correct template parameters inside a separate compilation unit. This way the compilation time is split between several compilation units, reducing the overall required time drastically too.

For a sample, showing how to achieve this, you may want to look at the Wave preprocessor library, where this technique is used extensively. (this should be available for download from Spirit's site as soon as you read this).

Closure frame assertion

Question: When I run the parser I get an assertion "frame.get() != 0 in file closures.hpp". What am I doing wrong?

Basically, the assertion fires when you are accessing a closure variable that is not constructed yet. Here's an example. We have three rules a, b and c. Consider that the rule a has a closure member m. Now:

    a = b;
    b = int_p[a.m = 123];
    c = b;

When the rule a is invoked, its frame is set, along with its member m. So, when b is called from a, the semantic action [a.m = 123]will store 123 into a's closure member m. On the other hand, when c is invoked, and c attempts to call b, no frame for a is set. Thus, when b is called from c, the semantic action [a.m = 123]will fire the "frame.get() != 0 in file closures.hpp" assertion.

Greedy RD

Question: I'm wondering why the this won't work when parsed:

    a = +anychar_p;
    b = '(' >> a >> ')';

Try this:

    a = +(anychar_p - ')');
    b = '(' >> a >> ')';

David Held writes: That's because it's like the langoliers--it eats everything up. You usually want to say what it shouldn't eat up by subtracting the terminating character from the parser. The moral being: Using *anychar_p or +anychar_p all by itself is usually a Bad Thing™.

In other words: Recursive Descent is inherently greedy (however, see Exhaustive backtracking and greedy RD).

Referencing a rule at construction time

Question: The code below terminates with a segmentation fault, but I'm (obviously) confused about what I'm doing wrong.

    rule<ScannerT, clos::context_t> id = int_p[id.i = arg1];

You have a rule id being constructed. Before it is constructed, you reference id.i in the RHS of the constructor. It's a chicken and egg thing. The closure member id.i is not yet constructed at that point. Using assignment will solve the problem. Try this instead:

    rule<ScannerT, clos::context_t> id;
    id = int_p[id.i = arg1];

Storing Rules

Question: Why can't I store rules in STL containers for later use and why can't I pass and return rules to and from functions by value?

EBNF is primarily declarative. Like in functional programming, It's a static recipe and there's no notion of do this then that. However, in Spirit, we managed to coax imperative C++ to take in declarative EBNF. Hah! Fun!... We did that by masquerading the C++ assignment operator to mimic EBNF's ::=, among other things (e.g. >>, |, & etc.). We used the rule class to let us do that by giving its assignment operator (and copy constructor) a different meaning and semantics. Doing so made the rule unlike any other C++ object. You can't copy it. You can't assign it. You can't place it in a container (vector, stack, etc).Heck, you can't even return it from a function *by value*.

The rule is a weird object, unlike any other C++ object. It does not have the proper copy and assignment semantics and cannot be stored and passed around by value.

However nice declarative EBNF is, the dynamic nature of C++ can be an advantage. We've seen this in action here and there. There are indeed some interesting applications of dynamic parsers using Spirit. Yet, we haven't fully utilized the power of dynamic parsing, unless(!), we have a rule that's not so alien to C++ (i.e. behaves as a good C++ object). With such a beast, we can write parsers that's defined at run time, as opposed to at compile time.

Now that I started focusing on rules (hey, check out the hunky new rule features), it might be a good time to implement the rule-holder. It is basically just a rule, but with C++ object semantics. Yet it's not as simple. Without true garbage collection, the implementation will be a bit tricky. We can't simply use reference counting because a rule-holder (hey, anyone here has a better name?) *is-a* rule, and rules are typically recursive and thus cyclic. The problem is which will own which.

Ok... this will do for now. You'll definitely see more of the rule-holder in the coming days.

Parsing Ints and Reals

Question: I was trying to parse an int or float value with the longest_d directive and put some actors on the alternatives to visualize the results. When I parse "123.456", the ouput reports:

  1. (int) has been matched: full match = false
  2. (double) has been matched: full match = true

That is not what I expected. What am I missing?

Actually, the problem is that both semantic actions of the int and real branch will be triggered because both branches will be tried. This doesn't buy us much. What actually wins in the end is what you expected. But there's no easy way to know which one wins. The problem stems from the ambiguity.

Case1: Consider this input: "2". Is it an int or a real? They are both (strictly following the grammar of a real).

Case2 : Now how about "1.0"? Is it an int or a real? They are both, albeit the int part gets a partial match: "1". That is why you are getting a (partial) match for your int rule (full match = false).

Instead of using the longest_d to parse ints and reals, what I suggest is to remove the ambiguity and use the plain short-circuiting alternatives. The first step is to use strict_real_p to make the first case unambiguous. Unlike real_p, strict_real_p requires a dot to be present for a number to be considered a successful match. Your grammar can be written unambiguously as:

    strict_real_p | int_p

Note that because ambiguity is resolved, attaching actions to both branches is safe. Only one will be triggered:

    strict_real_p[R] | int_p[I]

"1.0" ---> triggers R
"2" ---> triggers I

Again, as a rule of thumb, it is always best to resolve as much ambiguity as possible. The best grammars are those which involve no backtracking at all: an LL(1) grammar. Backtracking and semantic actions do not mix well.