...one of the most highly
regarded and expertly designed C++ library projects in the
world.
— Herb Sutter and Andrei
Alexandrescu, C++
Coding Standards
Copyright © 2001-2008 Joel de Guzman, Hartmut Kaiser
Distributed under the Boost Software License, Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
Table of Contents
“Examples of designs that meet most of the criteria for "goodness" (easy to understand, flexible, efficient) are a recursive- descent parser, which is traditional procedural code. Another example is the STL, which is a generic library of containers and algorithms depending crucially on both traditional procedural code and on parametric polymorphism.” --Bjarne Stroustrup
In the Mid 80s, Joel wrote his first calculator in Pascal. It has been an unforgettable coding experience. He was amazed how a mutually recursive set of functions can model a grammar specification. In time, the skills he acquired from that academic experience became very practical. Periodically Joel was tasked to do some parsing. For instance, whenever he needs to perform any form of I/O, even in binary, he tries to approach the task somewhat formally by writing a grammar using Pascal- like syntax diagrams and then write a corresponding recursive-descent parser. This worked very well.
The arrival of the Internet and the World Wide Web magnified this thousand-fold. At one point Joel had to write an HTML parser for a Web browser project. He got a recursive-descent HTML parser working based on the W3C formal specifications easily. He was certainly glad that HTML had a formal grammar specification. Because of the influence of the Internet, Joel then had to do more parsing. RFC specifications were everywhere. SGML, HTML, XML, even email addresses and those seemingly trivial URLs were all formally specified using small EBNF- style grammar specifications. This made him wish for a tool similar to big- time parser generators such as YACC and ANTLR, where a parser is built automatically from a grammar specification. Yet, he wants it to be extremely small; small enough to fit in my pocket, yet scalable.
It must be able to practically parse simple grammars such as email addresses to moderately complex grammars such as XML and perhaps some small to medium-sized scripting languages. Scalability is a prime goal. You should be able to use it for small tasks such as parsing command lines without incurring a heavy payload, as you do when you are using YACC or PCCTS. Even now that it has evolved and matured to become a multi-module library, true to its original intent, Spirit can still be used for extreme micro-parsing tasks. You only pay for features that you need. The power of Spirit comes from its modularity and extensibility. Instead of giving you a sledgehammer, it gives you the right ingredients to create a sledgehammer easily.
The result was Spirit. Spirit was a personal project that was conceived when Joel was doing R&D in Japan. Inspired by the GoF's composite and interpreter patterns, he realized that he can model a recursive-descent parser with hierarchical-object composition of primitives (terminals) and composites (productions). The original version was implemented with run-time polymorphic classes. A parser is generated at run time by feeding in production rule strings such as:
"prod ::= {'A' | 'B'} 'C';"
A compile function compiled the parser, dynamically creating a hierarchy of objects and linking semantic actions on the fly. A very early text can be found here: pre-Spirit.
Version 1.0 to 1.8 was a complete rewrite of the original Spirit parser using expression templates and static polymorphism, inspired by the works of Todd Veldhuizen (Expression Templates, C++ Report, June 1995). Initially, the static-Spirit version was meant only to replace the core of the original dynamic-Spirit. Dynamic-spirit needed a parser to implement itself anyway. The original employed a hand-coded recursive-descent parser to parse the input grammar specification strings. Incidentially it was the time, when Hartmut joined the Spirit development.
After its initial "open-source" debut in May 2001, static-Spirit became a success. At around November 2001, the Spirit website had an activity percentile of 98%, making it the number one parser tool at Source Forge at the time. Not bad for such a niche project such as a parser library. The "static" portion of Spirit was forgotten and static-Spirit simply became Spirit. The library soon evolved to acquire more dynamic features.
Spirit was formally accepted into Boost in October 2002. Boost is a peer-reviewed, open collaborative development effort that is a collection of free Open Source C++ libraries covering a wide range of domains. The Boost Libraries have become widely known as an industry standard for design and implementation quality, robustness, and reusability.
Over the years, especially after Spirit was accepted into Boost, Spirit has served its purpose quite admirably. The focus of what we'll now call Classic-Spirit (versions prior to 2.0) was on transduction parsing where the input string is merely translated to an output string. A lot of parsers are of the transduction type. When the time came to add attributes to the parser library, it was done rather in an ad-hoc manner, with the goal being 100% backward compatible with classic Spirit. Some parsers have attributes, some don't.
Spirit V2 is another major rewrite. Spirit V2 grammars are fully attributed (see Attribute Grammar). All parser components have attributes. To do this efficiently and ellegantly, we had to use a couple of infrastructure libraries. Some of which haven't been written yet at the time, some were quite new when Spirit debuted, and some needed work. MPL is an important infrastructure library, yet is not sufficient to implement Spirit V2. Another library had to be written: Fusion. Fusion sits between MPL and STL --between compile time and runtime -- mapping types to values. Fusion is a direct descendant of both MPL and Boost.Tuples (Fusion is now a full fledged Boost library). Phoenix also had to be beefed up to support Spirit V2. The result is Phoenix2. Last but not least, Spirit V2 uses an Expression Templates library called -Boost.Proto-.
Just before the development of Spirit V2 began, Hartmut came across the StringTemplate library which is a part of the ANTLR parser framework. It is a Java template engine (with ports for C# and Python) for generating source code, web pages, emails, or any other formatted text output. With it, he got the the idea of using a formal notation (a grammar) to describe the expected structure of an input character sequence. The same grammar may be used to formalize the structure of a corresponding output character sequence. This is possible because parsing, most of the time, is implemented by comparing the input with the patterns defined by the grammar. If we use the same patterns to format a matching output, the generated sequence will follow the rules of the grammar as well.
This insight lead to the implementation of a grammar driven output generation library compatibile with the Spirit parser library. As it turned out, parsing and generation are tightly connected and have very similar concepts. The duality of these two sides of the same medal is ubiquitous, which allowed us to build the parser library Spirit.Qi and the generator library Spirit.Karma using the same component infastructure.
The idea of creating a lexer library well integrated with the Spirit parsers is not new. This has been discussed almost for the whole time of the existence of Classic-Spirit (pre V2) now. Several attempts to integrate existing lexer libraries and frameworks with Spirit have been made and served as a proof of concept and usability (for example see Wave: The Boost C/C++ Preprocessor Library, and SLex: a fully dynamic C++ lexer implemented with Spirit). Based on these experiences we added Spirit.Lex: a fully integrated lexer library to the mix, allowing to take advantage of the power of regular expressions for token matching, removing pressure from the parser components, simplifying parser grammars. Again, Spirit's modular structure allowed us to reuse the same underlying component library as for the parser and generator libraries.
Each major section (there are two: Qi and Karma, and Lex) is roughly divided into 3 parts:
Some icons are used to mark certain topics indicative of their relevance. These icons precede some text to indicate:
Table 1. Icons
Icon |
Name |
Meaning |
---|---|---|
|
Note |
Generally useful information (an aside that doesn't fit in the flow of the text) |
|
Tip |
Suggestion on how to do something (especially something that not be obvious) |
|
Important |
Important note on something to take particular notice of |
|
Caution |
Take special care with this - it may not be what you expect and may cause bad results |
|
Danger |
This is likely to cause serious trouble if ignored |
This documentation is automatically generated by Boost QuickBook documentation tool. QuickBook can be found in the Boost Tools.
Please direct all questions to Spirit's mailing list. You can subscribe to the Spirit General List. The mailing list has a searchable archive. A search link to this archive is provided in Spirit's home page. You may also read and post messages to the mailing list through Spirit General NNTP news portal (thanks to Gmane). The news group mirrors the mailing list. Here is a link to the archives: http://news.gmane.org/gmane.comp.parsers.spirit.general.
Last revised: December 23, 2008 at 09:37:55 GMT |