Serialization

Special Considerations

Pointers to Objects of Derived Classes Registration Instantiation Selective Tracking Runtime Casting
Object Tracking
Class Information
Archive Portability Numerics Traits
Binary Archives
XML Archives
Archive Exceptions
Exception Safety

Pointers to Objects of Derived Classes

Registration

Consider the following:


class base {
    ...
};
class derived_one : public base {
    ...
};
class derived_two : public base {
    ...
};
main(){
    ...
    base *b;
    ar & b; 
}

When loading b what kind of object should be created? An object of class derived_one, derived_two, or maybe base?

If this situation is not addressed by one of the methods described below, an unregistered_class exception will be thrown when serialization is invoked.

Many times this situation is resolved automatically by the serialization library.

The system "registers" each class in an archive the first time an object of that class it is serialized and assigns a sequential number to it. Next time an object of that class is serialized in that same archive, this number is written in the archive. So every class is identified uniquely within the archive. When the archive is read back in, each new sequence number is re-associated with the class being read. Note that this implies that "registration" has to occur during both save and load so that the class-integer table built on load is identical to the class-integer table built on save. In fact, the key to whole serialization system is that things are always saved and loaded in the same sequence. This includes "registration"

In many situations the problem never comes up. Consider:


main(){
    derived_one d1;
    derived_two d2:
    ...
    ar >> d1;
    ar >> d2;
    // A side effect of serialization of objects d1 and d2 is that
    // the classes derived_one and derived_two become known to the archive.
    // So subsequent serialization of those classes by base pointer works
    // without any special considerations.
    base *b;
    ar & b; 
}

Here, the problem doesn't present itself. When b is read it is preceded by a unique (to the archive) class identifier which has previously been related to class derived_one or derived_two.

If a derived class hasn't been automatically "registered" as described above, we have the option of registering it explicitly. All archives are derived from a base class which implements the following template:


template<class T>
register_type();

So our problem could just as well be addressed by writing:


main(){
    ...
    ar.template register_type<derived_one>();
    ar.template register_type<derived_two>();
    base *b;
    ar & b; 
}

Note that if the serialization function is split between save and load, both functions must include the registration. This is required to keep the save and corresponding load in syncronization.

This will work but may be inconvenient. We don't always know which derived classes we are going to serialize when we write the code to serialize through a base class pointer. Every time a new derived class is written we have to go back to all the places where the base class is serialized and update the code.

So we have another method:


#include <boost/serialization/export.hpp>
...
BOOST_CLASS_EXPORT_GUID(derived_one, "derived_one")
BOOST_CLASS_EXPORT_GUID(derived_two, "derived_two")

main(){
    ...
    base *b;
    ar & b; 
}

The macro BOOST_CLASS_EXPORT_GUID associates a string literal with a class. In the above example we've used a string rendering of the class name. If a object of such an "exported" class is serialized through a pointer and is otherwise unregistered, the "export" string is included in the archive. When the archive is later read, the string literal is used to find the class which should be created by the serialization library. This permits each class to be in a separate header file along with its string identifier. There is no need to maintain a separate "pre-registration" of derived classes that might be serialized. This method of registration is referred to as "key export". More information on this topic is found in the section Class Traits - Export Key.

Instantiation

Registration by means of any of the above methods fulfill another role whose importance might not be obvious. This system relies on templated functions of the form template<class Archive, class T>. This means that serialization code must be instantiated for each combination of archive and data type that is serialized in the program.

Polymorphic pointers of derived classes may never be referred to explictly by the program so normally code to serialize such classes would never be instantiated. So in addition to including export key strings in an archive, BOOST_CLASS_EXPORT_GUID explicitly instantiates the class serialization code for all archive classes used by the program.

In order to do this, export.hpp includes meta programming code to build a mpl::list of all the file types used by the module by checking for definition of the header inclusion guards. Using this list, BOOST_CLASS_EXPORT_GUID will explicitly instantiate serialization code for all exported classes. For this implementaton to function, the header file export.hpp has to come after all the archive header files. This is enforced by code at the end of the header file: basic_archive.hpp which will trip a STATIC_ASSERT if this requirement is violated.

Selective Tracking

Whether or not an object is tracked is determined by its object tracking trait. The default setting for user defined types is track_selectively. That is, track objects if and only if they are serialized through pointers anywhere in the program. Any objects that are "registered" by any of the above means are presumed to be serialized through pointers somewhere in the program and therefore would be tracked. In certain situations this could lead to an inefficiency. Suppose we have a class module used by multiple programs. Because some programs serializes polymorphic pointers to objects of this class, we export a class identifier by specifying BOOST_CLASS_EXPORT in the class header. When this module is included by another program, objects of this class will always be tracked even though it may not be necessary. This situation could be addressed by using track_never in those programs.

It could also occur that even though a program serializes through a pointer, we are more concerned with efficiency than avoiding the the possibility of creating duplicate objects. It could be that we happen to know that there will be no duplicates. It could also be that the creation of a few duplicates is benign and not worth avoiding given the runtime cost of tracking duplicates. Again, track_never can be used.

Runtime Casting

In order to properly translate between base and derived pointers at runtime, the system requires each base/derived pair be found in a table. A side effect of serializing a base object with boost::serialization::base_object<Base>(Derived &) is to ensure that the base/derived pair is added to the table before the main function is entered. This is very convenient and results in a clean syntax. The only problem is that it can occur where a derived class serialized through a pointer has no need to invoke the serialization of its base class. In such a case, there are two choices. The obvious one is to invoke the base class serialization with base_object and specify an empty function for the base class serialization. The alternative is to "register" the Base/Derived relationship explicitly by invoking the template void_cast_register<Base, Derived>();. Note that this usage of the term "register" is not related to its usage in the previous section. Here is an example of how this is done:


#include <sstream>
#include <boost/serialization/serialization.hpp>
#include <boost/archive/text_iarchive.hpp>
#include <boost/serialization/export.hpp>

class base {
    friend class boost::serialization::access;
    //...
    // only required when using method 1 below
    // no real serialization required - specify a vestigial one
    template<class Archive>
    void serialize(Archive & ar, const unsigned int file_version){}
};

class derived : public base {
    friend class boost::serialization::access;
    template<class Archive>
    void serialize(Archive & ar, const unsigned int file_version){
        // method 1 : invoke base class serialization
        boost::serialization::base_object<base>(*this);
        // method 2 : explicitly register base/derived relationship
        boost::serialization::void_cast_register<base, derived>();
    }
};

BOOST_CLASS_EXPORT_GUID(derived, "derived")

main(){
    //...
    std::stringstream ss;
    boost::archive::text_iarchive ar(ss);
    base *b;
    ar >> b; 
}

Object Tracking

Depending on how the class is used and other factors, serialized objects may be tracked by memory address. This prevents the same object from being written to or read from an archive multiple times. This could cause problems in progams where the copies of different objects are serialized from the same address.


template<class Archive>
void save(boost::basic_oarchive  & ar, const unsigned int version) const
{
    for(int i = 0; i < 10; ++i){
        A x = a[i];
        ar << x;
    }
}

In this case, the data to be saved exists on the stack. Each iteration of the loop updates the value on the stack. So although the data changes each iteration, the address of the data doesn't. If a[i] is an array of objects being tracked by memory address, the library will skip storing objects after the first as it will be assumed that objects at the same address are really the same object.

To help detect such cases, output archive operators expect to be passed const reference arguments.

Given this, the above code will invoke a compile time assertion. The obvious fix in this example is to use


template<class Archive>
void save(boost::basic_oarchive & ar, const unsigned int version) const
{
    for(int i = 0; i < 10; ++i){
        ar << a[i];
    }
}

which will compile and run without problem.

The usage of const by the output archive operators will ensure that the process of serialization doesn't change the state of the objects being serialized. An attempt to do this would constitute augmentation of the concept of saving of state with some sort of non-obvious side effect. This would almost surely be a mistake and a likely source of very subtle bugs. As described above, addresses of objects serialized as pointers are stored in memory to prevent saving/loading of duplicate objects. These stored addresses can also be used to delete objects created during a loading process that has been interrupted by throwing of an exception. By default, code to implement this tracking is instantiated if and only if an object of the class is serialized through a pointer. If it is known a priori that no pointer values are duplicated, overhead associated with object tracking can be eliminated by setting the object tracking class serialization trait appropriately.

By definition, data types designated primitive by Implementation Level class serialization trait are never tracked. If it is desired to track a shared primitive object through a pointer (e.g. a long used as a reference count), It should be wrapped in a class/struct so that it is an identifiable type. The alternative of changing the implementation level of a long would affect all longs serialized in the whole program - probably not what one would intend.

It is possible that we may want to track addresses even though the object is never serialized through a pointer. For example, a virtual base class need be saved/loaded only once. By setting this serialization trait to track_always, we can suppress redundant save/load operations.


BOOST_CLASS_TRACKING(my_virtual_base_class, boost::serialization::track_always)

Class Information

By default, for each class serialized, class information is written to the archive. This information includes version number, implementation level and tracking behavior. This is necessary so that the archive can be correctly deserialized even if a subsequent version of the program changes some of the current trait values for a class. The space overhead for this data is minimal. There is a little bit of runtime overhead since each class has to be checked to see if it has already had its class information included in the archive. In some cases, even this might be considered too much. This extra overhead can be eliminated by setting the implementation level class trait to: boost::serialization::object_serializable.

Turning off tracking and class information serialization will result in pure template inline code that in principle could be optimised down to a simple stream write/read. Elimination of all serialization overhead in this manner comes at a cost. Once archives are released to users, the class serialization traits cannot be changed without invalidating the old archives. Including the class information in the archive assures us that they will be readable in the future even if the class definition is revised. A light weight structure such as display pixel might be declared in a header like this:


#include <boost/serialization/serialization.hpp>
#include <boost/serialization/level.hpp>
#include <boost/serialization/tracking.hpp>

// a pixel is a light weight struct which is used in great numbers.
struct pixel
{
    unsigned char red, green, blue;
    template<class Archive>
    void serialize(Archive & ar, const unsigned int /* version */){
        ar << red << green << blue;
    }
};

// elminate serialization overhead at the cost of
// never being able to increase the version.
BOOST_CLASS_IMPLEMENTATION(pixel, boost::serialization::object_serializable);

// eliminate object tracking (even if serialized through a pointer)
// at the risk of a programming error creating duplicate objects.
BOOST_CLASS_TRACKING(pixel, boost::serialization::track_never)

Archive Portability

Several archive classes create their data in the form of text or portable a binary format. It should be possible to save such an of such a class on one platform and load it on another. This is subject to a couple of conditions.

Numerics

The architecture of the machine reading the archive must be able hold the data saved. For example, the gcc compiler reserves 4 bytes to store a variable of type wchar_t while other compilers reserve only 2 bytes. So its possible that a value could be written that couldn't be represented by the loading program. This is a fairly obvious situation and easily handled by using the numeric types in <boost/cstdint.hpp>

Traits

Another potential problem is illustrated by the following example:


template<class T>
struct my_wrapper {
    template<class Archive>
    Archive & serialize ...
};

...

class my_class {
    wchar_t a;
    short unsigned b;
    template<<class Archive>
    Archive & serialize(Archive & ar, unsigned int version){
        ar & my_wrapper(a);
        ar & my_wrapper(b);
    }
};

If my_wrapper uses default serialization traits there could be a problem. With the default traits, each time a new type is added to the archive, bookkeeping information is added. So in this example, the archive would include such bookkeeping information for my_wrapper<wchar_t> and for my_wrapper<short_unsigned>. Or would it? What about compilers that treat wchar_t as a synonym for unsigned short? In this case there is only one distinct type - not two. If archives are passed between programs with compilers that differ in their treatment of wchar_t the load operation will fail in a catastrophic way.

One remedy for this is to assign serialization traits to the template my_template such that class information for instantiations of this template is never serialized. This process is described above and has been used for Name-Value Pairs. Wrappers would typically be assigned such traits.

Another way to avoid this problem is to assign serialization traits to all specializations of the template my_wrapper for all primitive types so that class information is never saved. This is what has been done for our implementation of serializations for STL collections.

Binary Archives

Standard stream i/o on some systems will expand linefeed characters to carriage-return/linefeed on output. This creates a problem for binary archives. The easiest way to handle this is to open streams for binary archives in "binary mode" by using the flag ios::binary. If this is not done, the archive generated will be unreadable.

Unfortunately, no way has been found to detect this error before loading the archive. Debug builds will assert when this is detected so that may be helpful in catching this error.

XML Archives

XML archives present a somewhat special case. XML format has a nested structure that maps well to the "recursive class member visitor" pattern used by the serialization system. However, XML differs from other formats in that it requires a name for each data member. Our goal is to add this information to the class serialization specification while still permiting the the serialization code to be used with any archive. This is achived by requiring that all data serialized to an XML archive be serialized as a name-value pair. The first member is the name to be used as the XML tag for the data item while the second is a reference to the data item itself. Any attempt to serialize data not wrapped in a in a name-value pair will be trapped at compile time. The system is implemented in such a way that for other archive classes, just the value portion of the data is serialized. The name portion is discarded during compilation. So by always using name-value pairs, it will be guarenteed that all data can be serialized to all archive classes with maximum efficiency.