...one of the most highly
regarded and expertly designed C++ library projects in the
world.
— Herb Sutter and Andrei
Alexandrescu, C++
Coding Standards
MySQL defines a character set as "a set of symbols and their respective
encodings". utf8mb4
,
utf16
and ascii
are character sets supported by MySQL.
A collation is a set of rules for comparing characters in a character set.
For example, a case-insensitive collation will make strings that only differ
in case compare equal. All collations are associated to a single character
set. For example, utf8mb4_spanish_ci
is a case-insensitive collation associated to the utf8mb4
character set. Every character set has a default collation, which will be used
if a character set without a collation is specified. For example, latin1_swedish_ci
is the default collation
for the latin1
character set.
You can find more information about these concepts in the official MySQL docs on character sets.
Every connection has an associated character set and collation. The connection's character set determines the encoding for character strings sent to and retrieved from the server. This includes SQL query strings, string fields and column names in metadata. The connection's collation is used for string literal comparison.
Every session you establish can have its own different character set and collation. You can specify this in two ways:
When calling connection::connect
or connection::handshake
,
using handshake_params::connection_collation
.
You specify a numeric ID that identifies the collation to use, and your
connection will use the character set associated to this collation. You
can find collation IDs in the <boost/mysql/mysql_collations.hpp>
and <boost/mysql/mariadb_collations.hpp>
headers.
The problem with this approach is that if you specify a collation ID that
is unknown to the server (e.g. utf8mb4_0900_ai_ci
for an old MySQL 5.7 server), the handshake operation will succeed but
the connection will silently fall back to the server's
default character set, (usually latin1
,
which is not Unicode).
SET NAMES
SQL statement. For example, "SET NAMES
utf8mb4"
will set the current connection's character
set to utf8mb4
and the
connection's collation to utf8mb4's default collation. If the character
set is unknown, the SET NAMES
statement will fail. You can use
connection::execute
to issue the statement:
results result; conn.execute("SET NAMES utf8mb4", result); // Further operations can assume utf8mb4 as conn's charset
Both of the above methods are shortcuts to set several session-level variables. The ones that impact this library's behavior are:
character_set_client
determines the encoding that SQL statements sent to the server should have.
This includes the SQL strings passed to connection::execute
and connection::prepare_statement
,
and string parameters passed to statement::bind
.
Not all character sets are permissible in character_set_client
.
The server will accept setting this variable to any UTF-8 character set,
but won't accept UTF-16.
character_set_results
determines the encoding that the server will use to send any kind of result,
including string fields retrieved by connection::execute
,
metadata like metadata::column_name
and error messages.
Note that, when you define a string column with a character set (e.g.
"CREATE TABLE t1 (col1 VARCHAR(5) CHARACTER
SET latin1)"
), the column's character set will be
used for storage and comparisons, but not for client communication. If
you set character_set_results
to utf16
, any field obtained
by SELECT
ing col1
will be UTF16-encoded, and not latin1-encoded.
Note also that metadata::column_collation
reflects the charset and collation the server has converted the column
to before sending it to the client. In the above example, metadata::column_collation
will be the default
collation for UTF16, rather than latin1_swedish_ci
.
The table below summarizes the encoding used by each piece of functionality in this library:
Functionality |
Encoding given by... |
---|---|
SQL query strings passed to |
|
String values passed as parameters to |
|
String fields retrieved by |
|
Metadata strings: |
|
Server-generated error messages: |
|
Informational messages: |
ASCII. These can only contain ASCII characters and are always ASCII encoded. More info in this section. |
any_connection
attempts to track the connection's current character set. You can access this
information using any_connection::current_character_set
and any_connection::format_opts
.
Note | |
---|---|
This functionality is only relevant when using SQL formatting and escaping
functions, like |
The MySQL protocol has limited support for character set tracking, so this
task requires some help from the user. Some situations can make the current
character set to be unknown. If this happens, any_connection::current_character_set
and any_connection::format_opts
return an unknown_character_set
error.
This is how tracking works:
any_connection::connect
or async_connect
succeed, heuristics are used to determine the current character set. This
is required because the server may reject the collation requested by connect_params::connection_collation
and silently fall back to an unknown character set. If Boost.MySQL is not
sure that the collation will be accepted, the current character set will
be left unknown. Note that leaving connect_params::connection_collation
to its default value always sets the current character set to utf8mb4_charset
.
any_connection::set_character_set
or async_set_character_set
to set the current character set to a known value. This will issue a SET NAMES
statement and also update the value stored in the client.
any_connection::reset_connection
or async_reset_connection
resets the character set to the server's default, which is unknown (usually
latin1
). The current character
set will be unknown until you call set_character_set
or async_set_character_set
.
Warning | |
---|---|
Do not execute |
Built-in support is provided for utf8mb4
(utf8mb4_charset
)
and ascii
(ascii_charset
).
We strongly encourage you to always use utf8mb4
.
If you really need to use a different character set, you can implement them
by creating character_set
objects. You can then pass them to functions like set_character_set
like the built-in ones.
The structure has the following members:
character_set::name
must match the name you would use in SET
NAMES
.
character_set::next_char
is used to iterate the string. It must return the length in bytes of the
first code point in the string, or 0 if the code point is invalid.
For example, this is how you could implement the utf8mb4
character set. For brevity, only a small part of the implementation is shown
- have a look at the definition of utf8mb4_charset
for a full implementation.
// next_char must interpret input as a string encoded according to the // utf8mb4 character set and return the size of the first character, // or 0 if the byte sequence does not represent a valid character. // It must not throw exceptions. std::size_t utf8mb4_next_char(boost::span<const unsigned char> input) { // Input strings are never empty - they always have 1 byte, at least. assert(!input.empty()); // In UTF8, we need to look at the first byte to know the character's length auto first_char = input[0]; if (first_char < 0x80) { // 0x00 to 0x7F: ASCII range. The character is 1 byte long return 1; } else if (first_char <= 0xc1) { // 0x80 to 0xc1: invalid. No UTF8 character starts with such a byte return 0; } else if (first_char <= 0xdf) { // 0xc2 to 0xdf: two byte characters. // It's vital that we check that the characters are valid. Otherwise, vulnerabilities can arise. // Check that the string has enough bytes if (input.size() < 2u) return 0; // The second byte must be between 0x80 and 0xbf. Otherwise, the character is invalid // Do not skip this check - otherwise escaping will yield invalid results if (input[1] < 0x80 || input[1] > 0xbf) return 0; // Valid, 2 byte character return 2; } // Omitted: 3 and 4 byte long characters else { return 0; } }