1.34 (Internationalization) Changes

Introduction

This release is a major upgrade for the Filesystem Library, in preparation for submission to the C++ Standards Committee. Features of this release include:

Rationale for some of the changes is also provided.

Internationalization

Cass templates basic_path, basic_filesystem_error, and basic_directory_iterator provide the basic mechanisms for internationalization, in ways very similar to the C++ Standard Library's basic_string and similar class templates. The following typedefs are also provided:

typedef basic_path<std::string, ...> path;
typedef basic_path<std::wstring, ...> wpath;

typedef basic_filesystem_error<path> filesystem_error;
typedef basic_filesystem_error<wpath> wfilesystem_error;

typedef basic_directory_iterator<path> directory_iterator;
typedef basic_directory_iterator<wpath> wdirectory_iterator;

The string type used by Boost.Filesystem basic_path (std::string, std::wstring, or whatever) is called the internal string type. The string type used by the operating system for paths (often char*, sometimes wchar_t*) is called the external string type. Conversion between internal and external types is performed by path traits classes. The specific conversions for path and wpath is implementation defined, with normative encouragement to use the operating system's preferred file system encoding. For many modern POSIX-based file systems the wpath external encoding is UTF-8, while for modern Windows file systems such as NTFS it is UTF-16.

The operational functions in operations.hpp are provided with overloads for path, wpath, and user-defined basic_path's. A "do-the-right-thing" rule applies to implementations, ensuring that the correct overload will be chosen.

Simplification of path interface

Prior versions of the library required users of class path to identify the format (native or generic) and name error-checking policy, either via a second constructor argument or via a default mechanism. That approach caused complaints, particularly from users not needing the name checking features. The interface has now been simplified:

Additionally, basic_filesystem_error has been put on a diet and generally simplified.

Error codes have been moved to a separate library, Boost.System.

"//:" has been introduced as a path escape prefix to identify native paths. Rationale: simplifies basic_path constructor interfaces, easier use for platforms needing explicit native format identification.

Rationalization of predicate functions

In discussions and bug reports on the Boost developers mailing list, it became obvious that Boost.Filesystem's exists(), symbolic_link_exists(), and is_directory() predicate functions were poorly specified. There were suggestions to add an is_accessible() function, but Peter Dimov argued that this amounted to papering over the lack of a clear specification and would likely lead to future problems.

Peter suggested that an interesting way to analyze the problem was to ask what the expectations were for true and false values of the various predicates. See the table below.

status()

As part of the predicate discussions, particularly with Rob Stewart, it became obvious that sometimes applications need access to raw status information without any possibility of an exception being thrown. The status() function was added to meet this need. It also proved clearer to specify the semantics of predicate functions in terms of status().

is_file()

About the same time, Jeff Garland suggested that an is_file() predicate would compliment is_directory(). In working on the analysis below, it became obvious that the expectations for is_file() were different from the expectations for !is_directory(), so is_file() was added.

is_other()

On some operating systems, it is possible to have a directory entry which is not for either a directory or a file. The is_other() function identifies such cases.

Should predicates throw on errors?

Some conditions reported by operating systems as errors (see footnote) clearly simply indicate that the predicate is false, rather than indicating serious failure. But other errors represent serious hardware or network problems, or permissions problems.

Some people, particularly Rob Stewart, argue that in a function like is_directory(), any error should simply cause the function to return false. If there is actually an underlying problem, it will be detected it due course when a directory_iterator or fstream operation is attempted.

That view is was rejected because of the following considerations:

However, the discussion did identify that there are valid cases where non-throwing behavior is a requirement, and a programmer may prefer to deal with file or directory attributes and errors at a very low, bit-mask, level. Function status() was proposed to meet those needs.

Expectations table

In the table below, p is a non-empty path.

Unless otherwise specified, all functions throw on hardware or general failure errors, permission or access errors, symbolic link loop errors, and invalid path errors. If an O/S fails to distinguish between error types, predicate operations return false on such ambiguous errors.

Expectations identify operations that are expected to succeed or fail, assuming no hardware, permission, or access right errors, and no race conditions.

Expression Expectations Semantics
is_directory(p) Returns true if p is found and is a directory, else false.
If true, then directory_iterator(p) would succeed.
If false, then directory_iterator(p) would fail.
Throws: if status() & error_flag
Returns: status() & directory_flag
is_file(p) Returns true if p is found and is not a directory, else false.
If true, then ifstream(p) would succeed.
False, however, does not imply ifstream(p) would fail (because some operating systems allow directories to be opened as files, but stat() does set the "regular file" flag.)
Throws: if status() & error_flag
Returns: status() & file_flag
exists(p) Returns is_directory(p) || is_file(p) || is_other(p) Throws: if status() & error_flag
Returns: status() &   (directory_flag|file_flag|other_flag)
is_symlink(p) Returns true if p is found by shallow (non-transitive) search, and is a symbolic link, else false.
If true, and p points to q, then for any filesystem function f except those specified as working shallowly on symlinks themselves, f(p) calls f(q), and returns any value returned by f(q).
Throws: if symlink_status() & error_flag
Returns: symlink_status() & symlink_flag
!exists(p) && ((p.has_branch_path() && exists( p.branch_path()) || (!p.has_branch_path() && !p.has_root_path()))
In other words, if the path does not exist, and (the branch does exist, or (there is no branch and no root)).
If true, create_directory(p) would succeed.
If true, ofstream(p) would succeed.
 
 
directory_iterator it(p) If it != directory_iterator(), assert(exists(*it)||is_symlink(*it)). Note: exists(*it) may throw, and likewise status(*it) may return error_flag - there is no guarantee of accessibility.  

Conclusion

Predicate operations is_directory(), is_file(), is_symlink(), and exists() with the indicated semantics form a self-consistent set that meets expectations.

Preservation of existing user code

Although the change to a template based approach required a complete overhaul of the implementation code, the interface as used by existing applications is mostly unchanged. Conversion problems which would otherwise affect user code have been reduced by providing deprecated functions to ease transition. The deprecated functions are:

// class basic_path - 2nd constructor argument ignored:
basic_path( const string_type & str, name_check );
basic_path( const typename string_type::value_type * s, name_check );

// class basic_path - old names provided for renamed functions:
string_type native_file_string() const;
string_type native_directory_string() const;

// class basic_path - now defined such that these no longer have any real effect:
static bool default_name_check_writable() { return false; } 
static void default_name_check( name_check ) {}
static name_check default_name_check() { return 0; }

// non-deducible operations functions assume class path
inline path current_path()
inline const path & initial_path()

// the new basic_directory_entry provides leaf()
// to cover the common existing use case itr->leaf()
typename Path::string_type leaf() const;

If you do not want the deprecated functions to be included, define the macro BOOST_FILESYSTEM_NO_DEPRECATED.

The greatest impact on existing code is the change of directory iterator value type from path to directory_entry. To ease the most common directory iterator use case, basic_directory_entry provides an automatic conversion to basic_path, and this also serves to prevent breakage of a lot of existing code. See the next section for discussion of rationale.

// the new basic_directory_entry provides:
operator const path_type &() const;

More efficient operations when iterating over directories

Several common real-world operating systems (BSD derivatives, Linux, Windows) provide status information during directory iteration. Caching of this status information results in three to six times faster operation for typical predicate operations. (For a directory containing 15,047 files, iteration in 1 second vs 6 seconds on a freshly booted system, and 0.3 seconds vs 0.9 seconds after prior use of the directory.

The efficiency gains from caching such status information were considered too significant to ignore. Because the possibility of race-conditions differs depending on whether the cached information is used or an actual system call is performed, it was considered necessary to provide explicit functions utilizing the cached information, rather than implicitly using the cache behind the scenes.

Three options were explored for exposing the cached status information, with full implementations of each. After initial implementation of option 1 exposed the problems noted below, option 2 was tested as a possible engineering tradeoff. Option 3 was finally chosen as the cleanest design.

Option How cache accessed Pros and Cons
1 Predicate function overloads
(basic_directory_iterator value_type is path)
  • Very Questionable design (friendship abuse, overload abuse, etc)
  • User cannot reuse cache
  • Readability problem; easy to miss difference between f(*it) and f(it)
  • Write-ability problem (error prone?)
  • Most common iterator use is brief: *it
  • Preserves existing code
2 Predicate member functions of basic_directory_iterator
(basic_directory_iterator value_type is path)
  • Somewhat cleaner design (although added iterator functions is unusual)
  • User cannot reuse cache
  • Readability and write-ability is OK: f(*it) and it.f() sufficiently different
  • Most common iterator use is brief: *it
  • Preserves existing code
3 Predicate member functions of basic_directory_entry
(basic_directory_iterator value_type is basic_directory_entry)
 
  • Cleanest design.
  • User can reuse cache.
  • Readability and write-ability is OK: f(*it) and it->f() sufficiently different.
  • Most common iterator use is longer: it->path(), but by providing "operator const basic_path &" it is still possible to write a bare *it.
  • Breaks some existing code. The "operator const basic_path &" conversion eliminates breakage of the most common use case, while providing a (deprecated) leaf() prevents breakage of the second most common use case.

Rationale

Elimination of the native versus generic distinction

Elimination of user confusion and general design simplification was the original motivation for elimination of the distinction between native and generic paths.

During design work, a further technical argument was discovered. Consider the path "c:foo/bar". On many POSIX systems, "c:foo" is a valid directory name, so we have a two element path and there is no issue of native versus generic format. On Windows system, however, "c:" is a drive specification, so we have a three element path. All calls to the operating system will result in "c:" being considered a drive specification; there is no way that fact-of-life can be changed by claiming the format is generic. The native versus generic distinction is thus useless and misleading for POSIX, Windows, and probably most other operating systems.

If paths for a particular operating system did require a distinction be made, it could be done by requiring that native paths be prefixed with some unique implementation-defined identification. For example, "native-path:". This would only be required for operating systems where (1) the distinction mattered, and (2) there was no lexical way to distinguish the two forms. For example, a native operating system that used the same syntax as the Filesystem Library's generic POSIX-like format, but processed the elements right-to-left instead of left-to-right.

Preservation of existing code

Allowing existing user code to continue to work with the updated version of the library has obvious benefits in terms of preserving the effort users have applied to both learning the library and writing code which uses the library.

There is an additional motivation; other than the name checking portion of class path,  the existing interface has proven to be useful and robust, so there is no reason to fiddle with it.

Single path design

During preliminary internationalization discussion on the Boost developer's list, a design was considered for a single path class which could hold either narrow or wide character based paths. That design was rejected because:

No versions of status() which throw exceptions on errors

The rationale for not including versions of status() which throw exceptions on errors is that (1) the primary purpose of this function is to perform queries at a very low-level, where exceptions are usually unwanted, and (2) exceptions on errors are already provided by the predicate functions. There would be little or no efficiency gain from providing a throwing version of status().

Symlink identifying version of status() function

A symlink identifying version of the status() function is distinguished by a second argument. Often separately named functions are more appropriate than overloading when behavior differs, which is the case here, while overloads are more appropriate when behavior is the same but argument types differ (Iain Hanson). Overloading was chosen in this particular case because a subjective judgment that a single function name with an optional "symlink" second argument produced more understandable code. The original implementation of the function used the name "symlink_status", but that just didn't read right in real code.

POSIX wpath_traits defaults to locale(""), but allows imbuing of locale

Vladimir Prus pointed out that for Linux (and presumably other POSIX operating systems) that need to convert wide character paths to narrow characters, the default conversion should not depend on the operating system alone, but on the std::locale("") default. For example, the usual encoding for Russian on Linux (and Russian web sites) is KOI8-R (RFC1489). The ability to safely specify a different locale is also provided, to meet unforeseen needs.


Revised 18 March, 2008

© Copyright Beman Dawes, 2005

Distributed under the Boost Software License, Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at www.boost.org/LICENSE_1_0.txt)