Title
Transcoding by std::formatter<std::filesystem::path>
Status
new
Section
[fs.path.fmtr.funcs]
Submitter
Jonathan Wakely

Created on 2024-04-19.00:00:00 last changed 2 weeks ago

Messages

Date: 2024-04-19.16:37:22

Proposed resolution:

This wording is relative to N4981.

  1. Modify [fs.path.fmtr.funcs] as indicated:

    
    template<class FormatContext>
      typename FormatContext::iterator
        format(const filesystem::path& p, FormatContext& ctx) const;
    
    -5- Effects: Let `s` be p.generic_string<filesystem::path::value_type>() if the `g` option is used, otherwise `p.native()`. Writes `s` into `ctx.out()`, adjusted according to the path-format-spec. If `charT` is `char`, `path::value_type` is `wchar_t`, and the literal encoding is UTF-8, then the escaped path (possibly escaped) string is transcoded from the native encoding for wide character strings to UTF-8 with maximal subparts of ill-formed subsequences substituted with u+fffd replacement character per the Unicode Standard, Chapter 3.9 u+fffd Substitution in Conversion. If `charT` and `path::value_type` are the same then no transcoding is performed. Otherwise, transcoding is implementation-defined.
  2. Modify the entry in the index of implementation-defined behavior as indicated:
    transcoding of a formatted `path` when `charT` and `path::value_type` differ and not converting from `wchar_t` to UTF-8
Date: 2024-04-19.00:00:00

[fs.path.fmtr.funcs] says:

If `charT` is `char`, `path::value_type` is `wchar_t`, and the literal encoding is UTF-8, then the escaped path is transcoded from the native encoding for wide character strings to UTF-8 with maximal subparts of ill-formed subsequences substituted with u+fffd replacement character per the Unicode Standard [...]. Otherwise, transcoding is implementation-defined.

This seems to mean that the Unicode substitutions are only done for an escaped path, i.e. when the `?` option is used. Otherwise, the form of transcoding is completely implementation-defined. However, this makes no sense. An escaped string will have no ill-formed subsequences, because they will already have been replaced as per [format.string.escaped]:

Otherwise (X is a sequence of ill-formed code units), each code unit U is appended to E in order as the sequence \x{hex-digit-sequence}, where hex-digit-sequence is the shortest hexadecimal representation of U using lower-case hexadecimal digits.

So only unescaped strings can have ill-formed sequences by the time we do transcoding to `char`, but whether or not any u+fffd substitution occurs is just implementation-defined.

I believe we want to specify the substitutions are done when transcoding an unescaped path (and it doesn't matter whether we specify it for escaped paths, because it's a no-op if escaping happens first, as is apparently intended).

It does matter whether we escape first or perform substitutions first. If we escape first then every code unit in an ill-formed sequence is individually escaped as `\x{hex-digit-sequence}`. So an ill-formed sequence of two `wchar_t` values will be escaped as two `\x{...}` strings, which are then transcoded to UTF-8. If we transcode (with substitutions first) then the entire ill-formed sequence is replaced with a single replacement character, which will then be escaped as `\x{fffd}`. SG16 should be asked to confirm that escaping first is intended, so that an escaped string shows the original invalid code units. For a non-escaped string, we want the ill-formed sequence to be formatted as �, which the proposed resolution tries to ensure.

History
Date User Action Args
2024-04-19 15:20:33adminsetmessages: + msg14064
2024-04-19 00:00:00admincreate