...one of the most highly
regarded and expertly designed C++ library projects in the
world.
— Herb Sutter and Andrei
Alexandrescu, C++
Coding Standards
Normalization allows us to determine if two URLs refer to the same resource. URLs comparisons serve the same purpose, where two strings are compared as if they were normalized.
There is no way to determine whether two URLs refer to the same resource without full knowledge or control of them. Thus, equivalence is based on string comparisons augmented by additional URL and scheme rules. This means comparison is not sufficient to determine whether two URLs identify different resources as the same resource can always be served from different addresses.
For this reason, comparison methods are designed to minimize false negatives while strictly avoiding false positives. In other words, if two URLs compare equal, they definitely represent the same resource. If they are considered different, they might still refer to the same resource depending on the application.
Context-dependent rules can be considered to minimize the number of false negatives, where cheaper methods have a higher chance of producing false negatives:
Simple String Comparison can be performed by accessing the underlying buffer of URLs:
url_view u1("https://www.boost.org/index.html"); url_view u2("https://www.boost.org/doc/../index.html"); assert(u1.buffer() != u2.buffer());
By only considering the rules of rfc3986, Simple String Comparison fails to identify the URLs above point to the same resource. The comparison operators implement Syntax-Based Normalization, which implements the rules defined by rfc3986.
url_view u1("https://www.boost.org/index.html"); url_view u2("https://www.boost.org/doc/../index.html"); assert(u1 == u2);
In mutable URLs, the member function normalize
can used to be apply Syntax-Based Normalization to a URL. A normalized URL
is represented by a canonical string where any two strings that would compare
equal end up with the same underlying representation. In other words, Simple
String Comparison of two normalized URLs is equivalent to Syntax-Based Normalization.
url_view u1("https://www.boost.org/index.html"); url u2("https://www.boost.org/doc/../index.html"); assert(u1.buffer() != u2.buffer()); assert(u1 == u2); u2.normalize(); assert(u1.buffer() == u2.buffer()); assert(u1 == u2);
Normalization uses the following definitions of rfc3986 to minimize false negatives:
The following example normalizes the percent-encoding and path segments of a URL:
url u("https://www.boost.org/doc/../%69%6e%64%65%78%20file.html"); u.normalize(); assert(u.buffer() == "https://www.boost.org/index%20file.html");
Syntax-Based Normalization can also be used as a first step for Scheme-Based and Protocol-Based Normalization. One common scheme-specific rule is ignoring the default port for that scheme and empty absolute paths:
auto normalize_http_url = [](url& u) { u.normalize(); if (u.port() == "80" || u.port().empty()) u.remove_port(); if (u.has_authority() && u.encoded_path().empty()) u.set_path_absolute(true); }; url u1("https://www.boost.org"); normalize_http_url(u1); url u2("https://www.boost.org/"); normalize_http_url(u2); url u3("https://www.boost.org:/"); normalize_http_url(u3); url u4("https://www.boost.org:80/"); normalize_http_url(u4); assert(u1.buffer() == "https://www.boost.org/"); assert(u2.buffer() == "https://www.boost.org/"); assert(u3.buffer() == "https://www.boost.org/"); assert(u4.buffer() == "https://www.boost.org/");
Other criteria commonly used to minimize false negatives for specific schemes are: