Boost.Locale
|
Boost.Locale provides to_utf, from_utf and utf_to_utf functions in the boost::locale::conv
namespace. They are simple and convenient functions to convert a string to and from UTF-8/16/32 strings and strings using other encodings.
For example:
std::string utf8_string = to_utf<char>(latin1_string,"Latin1"); std::wstring wide_string = to_utf<wchar_t>(latin1_string,"Latin1"); std::string latin1_string = from_utf(wide_string,"Latin1"); std::string utf8_string2 = utf_to_utf<char>(wide_string);
This function may use an explicit encoding name like "Latin1" or "ISO-8859-8", or use std::locale as a parameter to fetch this information from it. It also receives a policy parameter that tells it how to behave if the conversion can't be performed (i.e. an illegal or unsupported character is found). By default this function skips all illegal characters and tries to do the best it can, however, it is possible ask it to throw a conversion_error exception by passing the stop
flag to it:
std::wstring s=to_utf<wchar_t>("\xFF\xFF","UTF-8",stop); // Throws because this string is illegal in UTF-8
Boost.Locale provides stream codepage conversion facets based on the std::codecvt
facet. This allows conversion between wide-character encodings and 8-bit encodings like UTF-8, ISO-8859 or Shift-JIS.
Most of compilers provide such facets, but:
he_IL.CP1255
locale even when the he_IL
locale is available.Thus Boost.Locale provides an option to generate code-page conversion facets for use with Boost.Iostreams filters or std::wfstream
. For example:
std::locale loc= generator().generate("he_IL.UTF-8"); std::wofstream file. file.imbue(loc); file.open("hello.txt"); file << L"שלום!" << endl;
Would create a file hello.txt
encoded as UTF-8 with "שלום!" (shalom) in it.
You can use the std::codecvt
facet directly, but this is quite tricky and requires accurate buffer and error management.
You can use the boost::iostreams::code_converter
class for stream-oriented conversions between the wide-character set and narrow locale character set.
This is a sample program that converts wide to narrow characters for an arbitrary stream:
#include <boost/iostreams/stream.hpp> #include <boost/iostreams/categories.hpp> #include <boost/iostreams/code_converter.hpp> #include <boost/locale.hpp> #include <iostream> namespace io = boost::iostreams; // Device that consumes the converted text, // In our case it just writes to standard output class consumer { public: typedef char char_type; typedef io::sink_tag category; std::streamsize write(const char* s, std::streamsize n) { std::cout.write(s,n); return n; } }; int main() { // the device that converts wide characters // to narrow typedef io::code_converter<consumer> converter_device; // the stream that uses this device typedef io::stream<converter_device> converter_stream; consumer cons; // setup out converter to work // with he_IL.UTF-8 locale converter_device dev; boost::locale::generator gen; dev.imbue(gen("he_IL.UTF-8")); dev.open(cons); converter_stream stream; stream.open(dev); // Now wide characters that are written // to the stream would be given to // our consumer as narrow characters // in UTF-8 encoding stream << L"שלום" << std::flush; }
The Standard does not provide any information about std::mbstate_t
that could be used to save intermediate code-page conversion states. It leaves the definition up to the compiler implementation, making it impossible to reimplement std::codecvt<wchar_t,char,mbstate_t>
for stateful encodings. Thus, Boost.Locale's codecvt
facet implementation may be used with stateless encodings like UTF-8, ISO-8859, and Shift-JIS, but not with stateful encodings like UTF-7 or SCSU.
Recommendation: Prefer the Unicode UTF-8 encoding for char
based strings and files in your application.
The implementation of codecvt for single byte encodings like ISO-8859-X and for UTF-8 is very efficent and would allow fast conversion of the content, however its performance may be sub-optimal for double-width encodings like Shift-JIS, due to the stateless problem described above.