...one of the most highly
regarded and expertly designed C++ library projects in the
world.
— Herb Sutter and Andrei
Alexandrescu, C++
Coding Standards
According to MySQL
docs, a character set is a
set of symbols and their respective encodings. utf8mb4
,
utf16
and ascii
are character sets supported by MySQL. A collation
is a set of rules to compare characters, and is associated to a single character
set. For example, utf8mb4_spanish_ci
compares utf8mb4
characters
in a case-insensitive way.
Every client session has an associated character set and collation. The connection's character set determines the encoding for character strings sent to and retrieved from the server. This includes SQL query strings, string fields and column names in metadata. The connection's collation is used for string literal comparison. The connection's character set and collation can be changed dynamically using SQL.
By default, Boost.MySQL connections use utf8mb4_general_ci
,
thus using UTF-8 for all strings. We recommend
using this default, as MySQL character sets are easy to get wrong.
The connection's character set is not linked to the character set specified for databases, tables and columns. Consider the following declaration:
CREATE TABLE test_table( col1 TEXT CHARACTER SET utf16 COLLATE utf16_spanish_ci );
Data stored in col1
will be
encoded using UTF-16 and use utf16_spanish_ci
for comparisons. However, when sent to the client, it
will be converted to the connection's character set.
![]() |
Note |
---|---|
|
The connection's character set is crucial because it affects the encoding of most string fields. The following is a summary of what's affected:
async_execute
and async_prepare_statement
must be sent using the connection's character set. Otherwise, server-side
parsing errors may happen.
with_params
and format_sql
must be encoded using the connection's character set. Otherwise, values
will be rejected by Boost.MySQL when composing the query. Connections
track the character set in use
to detect these errors. If you bypass character set tracking (e.g. by using
SET NAMES
instead of async_set_character_set
),
you may run into vulnerabilities.
statement::bind
should use the connection's character set. Otherwise, MySQL may reject
the values.
diagnostics::server_message
)
also use the connection's character set.
To sum up, to properly use a connection, it's crucial to know the character set it's using.
The following sections provide a deep explanation on how character sets work in MySQL. If you don't have the time to read them, stick to the following advice:
connect_params
ensure that UTF-8 is used, without the need to run any SQL.
character_set_client
and character_set_results
session variables using async_execute
.
This breaks character set tracking, which can lead to vulnerabilities.
async_reset_connection
unless you know what you're doing. If you need to reuse connections, use
connection_pool
,
instead.
connection_pool
always use utf8mb4
. When
connections are returned to the pool, their character set is reset to
utf8mb4
.
There is a number of actions that can change the connection's character set:
async_connect
,
a numeric collation ID is supplied to the server. You can change it using
connect_params::connection_collation
.
The <boost/mysql/mysql_collations.hpp>
and <boost/mysql/mariadb_collations.hpp>
headers contain available collation IDs. If the server recognizes the passed
collation, the connection's character set will be the one associated to
the collation. If it doesn't, the connection will
silently fall back to the server's default character set (usually
latin1
, which is not Unicode).
This can happen when trying to use a newer collation, like utf8mb4_0900_ai_ci
, with an old MySQL
5.7 server. By default, Boost.MySQL uses utf8mb4_general_ci
,
supported by all servers.
async_reset_connection
resets the connection's character set to the server's
default character set.
async_set_character_set
executes a SET NAMES
statement to set the connection's character set. Executing a pipeline with
a set character set stage has the same results.
SET NAMES
, SET
CHARACTER SET
or modifying the character_set_client
and character_set_results
change the connection's character set. Don't do this,
as it will confuse character set tracking.
any_connection
attempts to track the connection's current character set because it's required
to securely perform client-side SQL formatting. This info is available using
any_connection::current_character_set
,
which returns a character_set
object. The current character set is also used by async_execute
when a with_params_t
object is passed, and by any_connection::format_opts
.
The MySQL protocol has limited support for character set tracking, so this
task requires some help from the user. Some situations can make the current
character set to be unknown. If this happens, executing a with_params_t
fails with client_errc::unknown_character_set
. any_connection::current_character_set
and any_connection::format_opts
also return this error.
Following the above points, this is how tracking works:
async_connect
succeeds, conservative heuristics are used to determine the current character
set. If the passed connect_params::connection_collation
is known to be accepted by all supported servers, its associated character
set becomes the current one. If the library is not sure, the current character
set is left unknown (this is the safe choice to avoid vulnerabilities).
Note that leaving connection_collation
to its default value always sets the current character set to utf8mb4_charset
.
async_set_character_set
sets the current character set to the passed one. The same applies for
a successful set character set pipeline stage.
async_reset_connection
makes the current character set unknown.
![]() |
Warning |
---|---|
Do not execute |
Built-in support is provided for utf8mb4
(utf8mb4_charset
)
and ascii
(ascii_charset
).
We strongly encourage you to always use utf8mb4
.
Note that MySQL doesn't support setting the connection's character set to UTF-16
or UTF-32.
If you really need to use a different character set, you can implement them
by creating character_set
objects. You can then pass them to functions like set_character_set
like the built-in ones.
![]() |
Note |
---|---|
This is an advanced technique. Don't use it unless you know what you are doing. |
The structure has the following members:
character_set::name
must match the name you would use in SET
NAMES
.
character_set::next_char
is used to iterate the string. It must return the length in bytes of the
first code point in the string, or 0 if the code point is invalid.
For example, this is how you could implement the utf8mb4
character set. For brevity, only a small part of the implementation is shown
- have a look at the definition of utf8mb4_charset
for a full implementation.
// next_char must interpret input as a string encoded according to the // utf8mb4 character set and return the size of the first character, // or 0 if the byte sequence does not represent a valid character. // It must not throw exceptions. std::size_t utf8mb4_next_char(boost::span<const unsigned char> input) { // Input strings are never empty - they always have 1 byte, at least. assert(!input.empty()); // In UTF8, we need to look at the first byte to know the character's length auto first_char = input[0]; if (first_char < 0x80) { // 0x00 to 0x7F: ASCII range. The character is 1 byte long return 1; } else if (first_char <= 0xc1) { // 0x80 to 0xc1: invalid. No UTF8 character starts with such a byte return 0; } else if (first_char <= 0xdf) { // 0xc2 to 0xdf: two byte characters. // It's vital that we check that the characters are valid. Otherwise, vulnerabilities can arise. // Check that the string has enough bytes if (input.size() < 2u) return 0; // The second byte must be between 0x80 and 0xbf. Otherwise, the character is invalid // Do not skip this check - otherwise escaping will yield invalid results if (input[1] < 0x80 || input[1] > 0xbf) return 0; // Valid, 2 byte character return 2; } // Omitted: 3 and 4 byte long characters else { return 0; } }
Setting the connection's character set during connection establishment or using
async_set_character_set
has the ultimate effect of changing some session variables. This section lists
them as a reference. We strongly encourage you not to
modify them manually, as this will confuse character set tracking.
async_execute
and async_prepare_statement
,
and string parameters passed to statement::bind
.
Not all character sets are permissible in character_set_client
.
For example, UTF-16 and UTF-32 based character sets won't be accepted.
connection::execute
,
metadata like metadata::column_name
and error messages. Note that metadata::column_collation
reflects the character set and collation the server has converted the column
to before sending it to the client. In the above example, metadata::column_collation
will be the default
collation for UTF16, rather than latin1_swedish_ci
.
The table below summarizes the encoding used by each piece of functionality in this library:
Functionality |
Encoding given by... |
---|---|
SQL query strings passed to |
|
Strings used with |
|
String values passed as parameters to |
|
String fields in rows retrieved from the server |
|
Metadata strings: |
|
Server-generated error messages: |
|
Informational messages: |
ASCII. These can only contain ASCII characters and are always ASCII encoded. More info in this section. |