Boost C++ Libraries

...one of the most highly regarded and expertly designed C++ library projects in the world. Herb Sutter and Andrei Alexandrescu, C++ Coding Standards

PrevUpHomeNext

Character sets and collations

According to MySQL docs, a character set is a set of symbols and their respective encodings. utf8mb4, utf16 and ascii are character sets supported by MySQL. A collation is a set of rules to compare characters, and is associated to a single character set. For example, utf8mb4_spanish_ci compares utf8mb4 characters in a case-insensitive way.

The connection character set and collation

Every client session has an associated character set and collation. The connection's character set determines the encoding for character strings sent to and retrieved from the server. This includes SQL query strings, string fields and column names in metadata. The connection's collation is used for string literal comparison. The connection's character set and collation can be changed dynamically using SQL.

By default, Boost.MySQL connections use utf8mb4_general_ci, thus using UTF-8 for all strings. We recommend using this default, as MySQL character sets are easy to get wrong.

The connection's character set is not linked to the character set specified for databases, tables and columns. Consider the following declaration:

CREATE TABLE test_table(
    col1 TEXT CHARACTER SET utf16 COLLATE utf16_spanish_ci
);

Data stored in col1 will be encoded using UTF-16 and use utf16_spanish_ci for comparisons. However, when sent to the client, it will be converted to the connection's character set.

[Note] Note

utf8mb4 is how MySQL calls regular UTF-8. Confusingly, MySQL has a character set named utf8 which is not UTF-8 compliant.

Connection character set effects

The connection's character set is crucial because it affects the encoding of most string fields. The following is a summary of what's affected:

To sum up, to properly use a connection, it's crucial to know the character set it's using.

Character set recommendations

The following sections provide a deep explanation on how character sets work in MySQL. If you don't have the time to read them, stick to the following advice:

Character set tracking

There is a number of actions that can change the connection's character set:

any_connection attempts to track the connection's current character set because it's required to securely perform client-side SQL formatting. This info is available using any_connection::current_character_set, which returns a character_set object. The current character set is also used by async_execute when a with_params_t object is passed, and by any_connection::format_opts.

The MySQL protocol has limited support for character set tracking, so this task requires some help from the user. Some situations can make the current character set to be unknown. If this happens, executing a with_params_t fails with client_errc::unknown_character_set. any_connection::current_character_set and any_connection::format_opts also return this error.

Following the above points, this is how tracking works:

[Warning] Warning

Do not execute SET NAMES, SET CHARACTER SET or any other SQL statement that modifies character_set_client using async_execute. This will make character set information stored in the client invalid.

Adding support for a character set

Built-in support is provided for utf8mb4 (utf8mb4_charset) and ascii (ascii_charset). We strongly encourage you to always use utf8mb4. Note that MySQL doesn't support setting the connection's character set to UTF-16 or UTF-32.

If you really need to use a different character set, you can implement them by creating character_set objects. You can then pass them to functions like set_character_set like the built-in ones.

[Note] Note

This is an advanced technique. Don't use it unless you know what you are doing.

The structure has the following members:

For example, this is how you could implement the utf8mb4 character set. For brevity, only a small part of the implementation is shown - have a look at the definition of utf8mb4_charset for a full implementation.

// next_char must interpret input as a string encoded according to the
// utf8mb4 character set and return the size of the first character,
// or 0 if the byte sequence does not represent a valid character.
// It must not throw exceptions.
std::size_t utf8mb4_next_char(boost::span<const unsigned char> input)
{
    // Input strings are never empty - they always have 1 byte, at least.
    assert(!input.empty());

    // In UTF8, we need to look at the first byte to know the character's length
    auto first_char = input[0];

    if (first_char < 0x80)
    {
        // 0x00 to 0x7F: ASCII range. The character is 1 byte long
        return 1;
    }
    else if (first_char <= 0xc1)
    {
        // 0x80 to 0xc1: invalid. No UTF8 character starts with such a byte
        return 0;
    }
    else if (first_char <= 0xdf)
    {
        // 0xc2 to 0xdf: two byte characters.
        // It's vital that we check that the characters are valid. Otherwise, vulnerabilities can arise.

        // Check that the string has enough bytes
        if (input.size() < 2u)
            return 0;

        // The second byte must be between 0x80 and 0xbf. Otherwise, the character is invalid
        // Do not skip this check - otherwise escaping will yield invalid results
        if (input[1] < 0x80 || input[1] > 0xbf)
            return 0;

        // Valid, 2 byte character
        return 2;
    }
    // Omitted: 3 and 4 byte long characters
    else
    {
        return 0;
    }
}

character_set_results and character_set_client

Setting the connection's character set during connection establishment or using async_set_character_set has the ultimate effect of changing some session variables. This section lists them as a reference. We strongly encourage you not to modify them manually, as this will confuse character set tracking.

The table below summarizes the encoding used by each piece of functionality in this library:

Functionality

Encoding given by...

SQL query strings passed to async_execute and async_prepare_statement

character_set_client

Strings used with with_params and format_sql

character_set_client

String values passed as parameters to statement::bind

character_set_client

String fields in rows retrieved from the server

character_set_results

Metadata strings:

metadata::database
metadata::table
metadata::original_table
metadata::column_name
metadata::original_column_name

character_set_results

Server-generated error messages: diagnostics::server_message

character_set_results

Informational messages:

results::info
execution_state::info

ASCII. These can only contain ASCII characters and are always ASCII encoded. More info in this section.


PrevUpHomeNext