Boost.Locale
Glossary
  • Basic Multilingual Plane (BMP) -- a part of the Universal Character Set with code points in the range U-0000--U-FFFF. The most commonly used UCS characters lay in this plane, including all Western, Cyrillic, Hebrew, Thai, Arabic and CJK characters. However there are many characters that lay outside the BMP and they are absolutely required for correct support of East Asian languages.
  • Code Point -- a unique number that represents a "character" in the Universal Character Set. Code points lay in the range of 0-0x10FFFF, and are usually displayed as U+XXXX or U+XXXXXX, where X represents a hexadecimal digit.
  • Collation -- a sorting order for text, usually alphabetical. It can differ between languages and countries, even for the same characters.
  • Encoding - a representation of a character set. Some encodings are capable of representing the full UCS range, like UTF-8, and others can only represent a subset of it -- ISO-8859-8 represents only a small subset of about 250 characters of the UCS.
    Non-Unicode encodings are still very popular, for example the Latin-1 (or ISO-8859-1) encoding covers most of the characters for Western European languages and significantly simplifies the processing of text for applications designed to handle only such languages.
    For Boost.Locale you should provide an eight-bit (std::string) encoding as part of the locale name, like en_US.UTF-8 or he_IL.cp1255 . UTF-8 is recommended.
  • Facet - or std::locale::facet -- a base class that every object that describes a specific locale is derived from. Facets can be added to a locale to provide additional culture information.
  • Formatting - representation of various values according to locale preferences. For example, a number 1234.5 (C representation) should be displayed as 1,234.5 in the US locale and 1.234,5 in the Russian locale. The date November 1st, 2005 would be represented as 11/01/2005 in the United States, and 01.11.2005 in Russia. This is an important part of localization.
    For example: does "You have to bring 134,230 kg of rice on 04/01/2010" means "134 tons of rice on the first of April" or "134 kg 230 g of rice on January 4th"? That is quite different.
  • Gettext - The GNU localization library used for message formatting. Today it is the de-facto standard localization library in the Open Source world. Boost.Locale message formatting is entirely built on Gettext message catalogs.
  • Locale - a set of parameters that define specific preferences for users in different cultures. It is generally defined by language, country, variants, and encoding, and provides information like: collation order, date-time formatting, message formatting, number formatting and many others. In C++, locale information is represented by the std::locale class.
  • Message Formatting -- the representation of user interface strings in the user's language. The process of translation of UI strings is generally done using some dictionary provided by the program's translator.
  • Message Domain -- in gettext terms, the keyword that represents a message catalog. This is usually an application name. When gettext and Boost.Locale search for a specific message catalog, they search in the specified path for a file named after the domain.
  • Normalization - Unicode normalization is the process of converting strings to a standard form, suitable for text processing and comparison. For example, character "ü" can be represented by a single code point or a combination of the character "u" and the diaeresis "¨". Normalization is an important part of Unicode text processing.
    Normalization is not locale-dependent, but because it is an important part of Unicode processing, it is included in the Boost.Locale library.
  • UCS-2 - a fixed-width Unicode encoding, capable of representing only code points in the Basic Multilingual Plane (BMP). It is a legacy encoding and is not recommended for use.
  • Unicode -- the industry standard that defines the representation and manipulation of text suitable for most languages and countries. It should not be confused with the Universal Character Set, it is a much larger standard that also defines algorithms like bidirectional display order, Arabic shaping, etc.
  • Universal Character Set (UCS) - an international standard that defines a set of characters for many scripts and their code points.
  • UTF-8 - a variable-width Unicode transformation format. Each UCS code point is represented as a sequence of between 1 and 4 octets that can be easily distinguished. It includes ASCII as a subset. It is the most popular Unicode encoding for web applications, data transfer and storage, and is the de-facto standard encoding for most POSIX operation systems.
  • UTF-16 - a variable-width Unicode transformation format. Each UCS code point is represented as a sequence of one or two 16-bit words. It is a very popular encoding for platforms such as the Win32 API, Java, C#, Python, etc. However, it is frequently confused with the _UCS-2_ fixed-width encoding, which can only represent characters in the Basic Multilingual Plane (BMP).
    This encoding is used for std::wstring under the Win32 platform, where sizeof(wchar_t)==2.
  • UTF-32/UCS-4 - a fixed-width Unicode transformation format, where each code point is represented as a single 32-bit word. It has the advantage of simple code point representation, but is wasteful in terms of memory usage. It is used for std::wstring encoding for most POSIX platforms, where sizeof(wchar_t)==4.
  • Case Folding - is a process of converting a text to case independent representation. For example case folding for a word "Grüßen" is "grüssen" - where the letter "ß" is represented in case independent way as "ss".
  • Title Case - Is a text conversion where the words are capitalized. For example "hello world" is converted to "Hello World"