Boost C++ Libraries

...one of the most highly regarded and expertly designed C++ library projects in the world. Herb Sutter and Andrei Alexandrescu, C++ Coding Standards

Character sets
PrevUpHomeNext

Character set refresher

MySQL defines a character set as "a set of symbols and their respective encodings". utf8mb4, utf16 and ascii are character sets supported by MySQL.

A collation is a set of rules for comparing characters in a character set. For example, a case-insensitive collation will make strings that only differ in case compare equal. All collations are associated to a single character set. For example, utf8mb4_spanish_ci is a case-insensitive collation associated to the utf8mb4 character set. Every character set has a default collation, which will be used if a character set without a collation is specified. For example, latin1_swedish_ci is the default collation for the latin1 character set.

You can find more information about these concepts in the official MySQL docs on character sets.

The connection character set and collation

Every connection has an associated character set and collation. The connection's character set determines the encoding for character strings sent to and retrieved from the server. This includes SQL query strings, string fields and column names in metadata. The connection's collation is used for string literal comparison.

Every session you establish can have its own different character set and collation. You can specify this in two ways:

  • When calling connection::connect or connection::handshake, using handshake_params::connection_collation. You specify a numeric ID that identifies the collation to use, and your connection will use the character set associated to this collation. You can find collation IDs in the <boost/mysql/mysql_collations.hpp> and <boost/mysql/mariadb_collations.hpp> headers.

    The problem with this approach is that if you specify a collation ID that is unknown to the server (e.g. utf8mb4_0900_ai_ci for an old MySQL 5.7 server), the handshake operation will succeed but the connection will silently fall back to the server's default character set, (usually latin1, which is not Unicode).

  • At any time, issuing a SET NAMES SQL statement. For example, "SET NAMES utf8mb4" will set the current connection's character set to utf8mb4 and the connection's collation to utf8mb4's default collation. If the character set is unknown, the SET NAMES statement will fail. You can use connection::execute to issue the statement:
results result;
conn.execute("SET NAMES utf8mb4", result);
// Further operations can assume utf8mb4 as conn's charset

character_set_results and character_set_client

Both of the above methods are shortcuts to set several session-level variables. The ones that impact this library's behavior are:

  • character_set_client determines the encoding that SQL statements sent to the server should have. This includes the SQL strings passed to connection::execute and connection::prepare_statement, and string parameters passed to statement::bind.

    Not all character sets are permissible in character_set_client. The server will accept setting this variable to any UTF-8 character set, but won't accept UTF-16.

  • character_set_results determines the encoding that the server will use to send any kind of result, including string fields retrieved by connection::execute, metadata like metadata::column_name and error messages.

    Note that, when you define a string column with a character set (e.g. "CREATE TABLE t1 (col1 VARCHAR(5) CHARACTER SET latin1)"), the column's character set will be used for storage and comparisons, but not for client communication. If you set character_set_results to utf16, any field obtained by SELECTing col1 will be UTF16-encoded, and not latin1-encoded. Note also that metadata::column_collation reflects the charset and collation the server has converted the column to before sending it to the client. In the above example, metadata::column_collation will be the default collation for UTF16, rather than latin1_swedish_ci.

The table below summarizes the encoding used by each piece of functionality in this library:

Functionality

Encoding given by...

SQL query strings passed to connection::execute and connection::prepare_statement

character_set_client

String values passed as parameters to statement::bind

character_set_client

String fields retrieved by connection::execute or connection::read_some_rows:

field_view::as_string
field_view::get_string

character_set_results

Metadata strings:

metadata::database
metadata::table
metadata::original_table
metadata::column_name
metadata::original_column_name

character_set_results

Server-generated error messages: diagnostics::server_message

character_set_results

Informational messages:

results::info
execution_state::info

ASCII. These can only contain ASCII characters and are always ASCII encoded. More info in this section.

(Experimental) Character set tracking

any_connection attempts to track the connection's current character set. You can access this information using any_connection::current_character_set and any_connection::format_opts.

[Note] Note

This functionality is only relevant when using SQL formatting and escaping functions, like format_sql, format_context or escape_string.

The MySQL protocol has limited support for character set tracking, so this task requires some help from the user. Some situations can make the current character set to be unknown. If this happens, any_connection::current_character_set and any_connection::format_opts return an unknown_character_set error.

This is how tracking works:

[Warning] Warning

Do not execute SET NAMES, SET CHARACTER SET or any other SQL statement that modifies character_set_client using execute. This will make character set information stored in the client invalid.

(Experimental) Adding support for a character set

Built-in support is provided for utf8mb4 (utf8mb4_charset) and ascii (ascii_charset). We strongly encourage you to always use utf8mb4.

If you really need to use a different character set, you can implement them by creating character_set objects. You can then pass them to functions like set_character_set like the built-in ones.

The structure has the following members:

  • character_set::name must match the name you would use in SET NAMES.
  • character_set::next_char is used to iterate the string. It must return the length in bytes of the first code point in the string, or 0 if the code point is invalid.

For example, this is how you could implement the utf8mb4 character set. For brevity, only a small part of the implementation is shown - have a look at the definition of utf8mb4_charset for a full implementation.

// next_char must interpret input as a string encoded according to the
// utf8mb4 character set and return the size of the first character,
// or 0 if the byte sequence does not represent a valid character.
// It must not throw exceptions.
std::size_t utf8mb4_next_char(boost::span<const unsigned char> input)
{
    // Input strings are never empty - they always have 1 byte, at least.
    assert(!input.empty());

    // In UTF8, we need to look at the first byte to know the character's length
    auto first_char = input[0];

    if (first_char < 0x80)
    {
        // 0x00 to 0x7F: ASCII range. The character is 1 byte long
        return 1;
    }
    else if (first_char <= 0xc1)
    {
        // 0x80 to 0xc1: invalid. No UTF8 character starts with such a byte
        return 0;
    }
    else if (first_char <= 0xdf)
    {
        // 0xc2 to 0xdf: two byte characters.
        // It's vital that we check that the characters are valid. Otherwise, vulnerabilities can arise.

        // Check that the string has enough bytes
        if (input.size() < 2u)
            return 0;

        // The second byte must be between 0x80 and 0xbf. Otherwise, the character is invalid
        // Do not skip this check - otherwise escaping will yield invalid results
        if (input[1] < 0x80 || input[1] > 0xbf)
            return 0;

        // Valid, 2 byte character
        return 2;
    }
    // Omitted: 3 and 4 byte long characters
    else
    {
        return 0;
    }
}

PrevUpHomeNext