Why do we need a localization library, when standard C++ facets (should) provide most of the required functionality:
std::collateand has nice integration with
std::time_getfor numbers, time, and currency formatting and parsing.
std::messagesclass that supports localized message formatting.
So why do we need such library if we have all the functionality within the standard library?
Almost every(!) facet has design flaws:
std::collatesupports only one level of collation, not allowing you to choose whether case- or accent-sensitive comparisons should be performed.
std::ctype, which is responsible for case conversion, assumes that all conversions can be done on a per-character basis. This is probably correct for many languages but it isn't correct in general.
toupperfunction works on a single-character basis.
char's or two
wchar_t's on the Windows platform. This makes
std::ctypetotally useless with these encodings.
std::moneypunctdo not specify the code points for digit representation at all, so they cannot format numbers with the digits used under Arabic locales. For example, the number "103" is expected to be displayed as "١٠٣" in the
std::moneypunctassume that the thousands separator is a single character. This is untrue for the UTF-8 encoding where only Unicode 0-0x7F range can be represented as a single character. As a result, localized numbers can't be represented correctly under locales that use the Unicode "EN SPACE" character for the thousands separator, such as Russian.
std::time_gethave several flaws:
std::tmfor time representation, ignoring the fact that in many countries dates may be displayed using different calendars.
std::tmdoesn't even include a timezone field at all.
std::time_getis not symmetric with
std::time_put, so you cannot parse dates and times created with
std::time_put. (This issue is addressed in C++0x and some STL implementation like the Apache standard C++ library.)
std::messagesdoes not provide support for plural forms, making it impossible to correctly localize such simple strings as "There are X files in the directory".
Also, many features are not really supported by
std::locale at all: timezones (as mentioned above), text boundary analysis, number spelling, and many others. So it is clear that the standard C++ locales are problematic for real-world applications.
ICU is a very good localization library, but it has several serious flaws:
For example: Boost.Locale provides direct integration with
iostream allowing a more natural way of data formatting. For example:
ICU is one of the best localization/Unicode libraries available. It consists of about half a million lines of well-tested, production-proven source code that today provides state-of-the art localization tools.
Reimplementing of even a small part of ICU's abilities is an infeasible project which would require many man-years. So the question is not whether we need to reimplement the Unicode and localization algorithms from scratch, but "Do we need a good localization library in Boost?"
Thus Boost.Locale wraps ICU with a modern C++ interface, allowing future reimplementation of parts with better alternatives, but bringing localization support to Boost today and not in the not-so-near-if-at-all future.
Yes, the entire ICU API is hidden behind opaque pointers and users have no access to it. This is done for several reasons:
There are many available localization formats. The most popular so far are OASIS XLIFF, GNU gettext po/mo files, POSIX catalogs, Qt ts/tm files, Java properties, and Windows resources. However, the last three are useful only in their specific areas, and POSIX catalogs are too simple and limited, so there are only two reasonable options:
The first one generally seems like a more correct localization solution, but it requires XML parsing for loading documents, it is very complicated format, and even ICU requires preliminary compilation of it into ICU resource bundles.
On the other hand:
So, even though the GNU Gettext mo catalog format is not an officially approved file format:
There are several reasons:
ptime– definitely could be used, but it has several problems:
time()gives a representation that is independent of time zones (usually GMT time), and only later should it be represented in a time zone that the user requests.
operator>>for time formatting and parsing.
ptimeformatting and parsing were not designed in a way that the user can override. The major formatting and parsing functions are not virtual. This makes it impossible to reimplement the formatting and parsing functions of
ptimeunless the developers of the Boost.DateTime library decide to change them.
ptimeare not "correctly" designed in terms of division of formatting information and locale information. Formatting information should be stored within
std::ios_baseand information about locale-specific formatting should be stored in the facet itself.
Thus, at this point,
ptime is not supported for formatting localized dates and times.
There are several reasons:
There are two reasons:
std::codecvtAPI works on streams of any size without problems.
There are several major reasons:
std::localeclass is build. Each feature is represented using a subclass of
std::locale::facetthat provides an abstract API for specific operations it works on, see Introduction to C++ Standard Library localization support.
There are several reasons:
char32_tas distinct types, so substituting is with something like
uint32_twould not work as for example writing
uint32_tstream would write a number to stream.
std::num_putare installed into the existing instance of
std::locale, however in the many standard C++ libraries these facets are specialized for each specific character that the standard library supports, so an attempt to create a new facet would fail as it is not specialized.
These are exactly the reasons why Boost.Locale fails with current limited C++0x characters support on GCC-4.5 (the second reason) and MSVC-2010 (the first reason)
So basically it is impossible to use non-C++ characters with the C++'s locales framework.
The best and the most portable solution is to use the C++'s
char type and UTF-8 encodings.