Title
Transcoding by std::formatter<std::filesystem::path>
Status
open
Section
[fs.path.fmtr.funcs]
Submitter
Jonathan Wakely

Created on 2024-04-19.00:00:00 last changed 2 weeks ago

Messages

Date: 2025-06-16.05:19:19

Proposed resolution:

This wording is relative to N5008.

  1. Modify [fs.path.fmtr.funcs] as indicated:

    
    template<class FormatContext>
      typename FormatContext::iterator
        format(const filesystem::path& p, FormatContext& ctx) const;
    

    -5- Effects: Let `s` be p.generic_string<filesystem::path::value_type>() if the `g` option is used, otherwise `p.native()`. Writes `s` into `ctx.out()`, adjusted according to the path-format-spec. If `charT` is `char`, `path::value_type` is `wchar_t`, and the ordinary literal encoding is UTF-8, then the escaped path (possibly escaped) string is transcoded from the native encoding for wide character strings to UTF-8 with maximal subparts of ill-formed subsequences substituted with u+fffd replacement character per the Unicode Standard, Chapter 3.9 u+fffd Substitution in Conversion. If `charT` and `path::value_type` are the same then no transcoding is performed. Otherwise, transcoding is implementation-defined.

  2. Modify the entry in the index of implementation-defined behavior as indicated:
    transcoding of a formatted `path` when `charT` and `path::value_type` differ and the ordinary literal encoding is not UTF-8
Date: 2025-06-15.00:00:00

[ 2025-06-11; SG16 comments and improves wording ]

The "and not converting from `wchar_t` to UTF-8" wording added in the index of implementation-defined behavior by the current proposed resolution should be changed to "and the literal encoding is not UTF-8".

It was noted that "the literal encoding" is ambiguous in both the normative wording in [fs.path.fmtr.funcs] p5 and in the new wording quoted above. In both cases, the intent is to refer to the "ordinary literal encoding". However, some SG16 participants were reluctant to include a drive-by fix with the proposed resolution for this issue since the ambiguous literal encoding reference i s a pre-existing and separable issue. Those same SG16 participants were more concerned that the same wording was used in both [fs.path.fmtr.funcs] p5 and in the corresponding entry of the implementation-defined behavior index. I would defer to the LWG chair to decide whether to address this as an additional related clarification with this change or as a separate editorial or LWG issue.

The minimal change is to replace "and not converting from `wchar_t` to UTF-8" with "and the literal encoding is not UTF-8". The optional change is to insert "ordinary" before "literal encoding" as well. Once that is done, I'll have SG16 confirm they are content with the new proposed resolution.

Date: 2024-05-15.00:00:00

[ 2024-05-08; Reflector poll ]

Set priority to 2 after reflector poll.

This wording is relative to N4981.

  1. Modify [fs.path.fmtr.funcs] as indicated:

    
    template<class FormatContext>
      typename FormatContext::iterator
        format(const filesystem::path& p, FormatContext& ctx) const;
    
    -5- Effects: Let `s` be p.generic_string<filesystem::path::value_type>() if the `g` option is used, otherwise `p.native()`. Writes `s` into `ctx.out()`, adjusted according to the path-format-spec. If `charT` is `char`, `path::value_type` is `wchar_t`, and the literal encoding is UTF-8, then the escaped path (possibly escaped) string is transcoded from the native encoding for wide character strings to UTF-8 with maximal subparts of ill-formed subsequences substituted with u+fffd replacement character per the Unicode Standard, Chapter 3.9 u+fffd Substitution in Conversion. If `charT` and `path::value_type` are the same then no transcoding is performed. Otherwise, transcoding is implementation-defined.
  2. Modify the entry in the index of implementation-defined behavior as indicated:
    transcoding of a formatted `path` when `charT` and `path::value_type` differ and not converting from `wchar_t` to UTF-8
Date: 2024-04-19.00:00:00

[fs.path.fmtr.funcs] says:

If `charT` is `char`, `path::value_type` is `wchar_t`, and the literal encoding is UTF-8, then the escaped path is transcoded from the native encoding for wide character strings to UTF-8 with maximal subparts of ill-formed subsequences substituted with u+fffd replacement character per the Unicode Standard [...]. Otherwise, transcoding is implementation-defined.

This seems to mean that the Unicode substitutions are only done for an escaped path, i.e. when the `?` option is used. Otherwise, the form of transcoding is completely implementation-defined. However, this makes no sense. An escaped string will have no ill-formed subsequences, because they will already have been replaced as per [format.string.escaped]:

Otherwise (X is a sequence of ill-formed code units), each code unit U is appended to E in order as the sequence \x{hex-digit-sequence}, where hex-digit-sequence is the shortest hexadecimal representation of U using lower-case hexadecimal digits.

So only unescaped strings can have ill-formed sequences by the time we do transcoding to `char`, but whether or not any u+fffd substitution occurs is just implementation-defined.

I believe we want to specify the substitutions are done when transcoding an unescaped path (and it doesn't matter whether we specify it for escaped paths, because it's a no-op if escaping happens first, as is apparently intended).

It does matter whether we escape first or perform substitutions first. If we escape first then every code unit in an ill-formed sequence is individually escaped as `\x{hex-digit-sequence}`. So an ill-formed sequence of two `wchar_t` values will be escaped as two `\x{...}` strings, which are then transcoded to UTF-8. If we transcode (with substitutions first) then the entire ill-formed sequence is replaced with a single replacement character, which will then be escaped as `\x{fffd}`. SG16 should be asked to confirm that escaping first is intended, so that an escaped string shows the original invalid code units. For a non-escaped string, we want the ill-formed sequence to be formatted as �, which the proposed resolution tries to ensure.

History
Date User Action Args
2025-06-15 10:31:39adminsetmessages: + msg14831
2024-05-08 10:07:31adminsetmessages: + msg14117
2024-05-08 10:07:31adminsetstatus: new -> open
2024-04-19 15:20:33adminsetmessages: + msg14064
2024-04-19 00:00:00admincreate