Issue 4070: Transcoding by std::formatter<std::filesystem::path>

Title: Transcoding by std::formatter<std::filesystem::path>
Status: open
Section: [fs.path.fmtr.funcs]
Submitter: Jonathan Wakely

Created on 2024-04-19.00:00:00 last changed 1 month ago

Messages

msg14831 (view)

Date: 2025-06-16.05:19:19

Proposed resolution:

This wording is relative to N5008.

Modify [fs.path.fmtr.funcs] as indicated:
```
template<class FormatContext>
  typename FormatContext::iterator
    format(const filesystem::path& p, FormatContext& ctx) const;
```
-5- Effects: Let `s` be p.generic_string<filesystem::path::value_type>() if the `g` option is used, otherwise `p.native()`. Writes `s` into `ctx.out()`, adjusted according to the path-format-spec. If `charT` is `char`, `path::value_type` is `wchar_t`, and the ordinary literal encoding is UTF-8, then the ~~escaped path~~ (possibly escaped) string is transcoded from the native encoding for wide character strings to UTF-8 with maximal subparts of ill-formed subsequences substituted with u+fffd replacement character per the Unicode Standard, Chapter 3.9 u+fffd Substitution in Conversion. If `charT` and `path::value_type` are the same then no transcoding is performed. Otherwise, transcoding is implementation-defined.
Modify the entry in the index of implementation-defined behavior as indicated:
transcoding of a formatted `path` when `charT` and `path::value_type` differ and the ordinary literal encoding is not UTF-8

msg14117 (view)

Date: 2025-06-15.00:00:00

[ 2025-06-11; SG16 comments and improves wording ]

The "and not converting from `wchar_t` to UTF-8" wording added in the index of implementation-defined behavior by the current proposed resolution should be changed to "and the literal encoding is not UTF-8".

It was noted that "the literal encoding" is ambiguous in both the normative wording in [fs.path.fmtr.funcs] p5 and in the new wording quoted above. In both cases, the intent is to refer to the "ordinary literal encoding". However, some SG16 participants were reluctant to include a drive-by fix with the proposed resolution for this issue since the ambiguous literal encoding reference i s a pre-existing and separable issue. Those same SG16 participants were more concerned that the same wording was used in both [fs.path.fmtr.funcs] p5 and in the corresponding entry of the implementation-defined behavior index. I would defer to the LWG chair to decide whether to address this as an additional related clarification with this change or as a separate editorial or LWG issue.

The minimal change is to replace "and not converting from `wchar_t` to UTF-8" with "and the literal encoding is not UTF-8". The optional change is to insert "ordinary" before "literal encoding" as well. Once that is done, I'll have SG16 confirm they are content with the new proposed resolution.

msg14064 (view)

Date: 2024-05-15.00:00:00

[ 2024-05-08; Reflector poll ]

Set priority to 2 after reflector poll.

This wording is relative to N4981.

Modify [fs.path.fmtr.funcs] as indicated:
```
template<class FormatContext>
  typename FormatContext::iterator
    format(const filesystem::path& p, FormatContext& ctx) const;
```
-5- Effects: Let `s` be p.generic_string<filesystem::path::value_type>() if the `g` option is used, otherwise `p.native()`. Writes `s` into `ctx.out()`, adjusted according to the path-format-spec. If `charT` is `char`, `path::value_type` is `wchar_t`, and the literal encoding is UTF-8, then the ~~escaped path~~ (possibly escaped) string is transcoded from the native encoding for wide character strings to UTF-8 with maximal subparts of ill-formed subsequences substituted with u+fffd replacement character per the Unicode Standard, Chapter 3.9 u+fffd Substitution in Conversion. If `charT` and `path::value_type` are the same then no transcoding is performed. Otherwise, transcoding is implementation-defined.
Modify the entry in the index of implementation-defined behavior as indicated:
transcoding of a formatted `path` when `charT` and `path::value_type` differ and not converting from `wchar_t` to UTF-8

msg14063 (view)

Date: 2024-04-19.00:00:00

[fs.path.fmtr.funcs] says:

If `charT` is `char`, `path::value_type` is `wchar_t`, and the literal encoding is UTF-8, then the escaped path is transcoded from the native encoding for wide character strings to UTF-8 with maximal subparts of ill-formed subsequences substituted with u+fffd replacement character per the Unicode Standard [...]. Otherwise, transcoding is implementation-defined.

This seems to mean that the Unicode substitutions are only done for an escaped path, i.e. when the `?` option is used. Otherwise, the form of transcoding is completely implementation-defined. However, this makes no sense. An escaped string will have no ill-formed subsequences, because they will already have been replaced as per [format.string.escaped]:

Otherwise (X is a sequence of ill-formed code units), each code unit U is appended to E in order as the sequence \x{hex-digit-sequence}, where hex-digit-sequence is the shortest hexadecimal representation of U using lower-case hexadecimal digits.

So only unescaped strings can have ill-formed sequences by the time we do transcoding to `char`, but whether or not any u+fffd substitution occurs is just implementation-defined.

I believe we want to specify the substitutions are done when transcoding an unescaped path (and it doesn't matter whether we specify it for escaped paths, because it's a no-op if escaping happens first, as is apparently intended).

It does matter whether we escape first or perform substitutions first. If we escape first then every code unit in an ill-formed sequence is individually escaped as `\x{hex-digit-sequence}`. So an ill-formed sequence of two `wchar_t` values will be escaped as two `\x{...}` strings, which are then transcoded to UTF-8. If we transcode (with substitutions first) then the entire ill-formed sequence is replaced with a single replacement character, which will then be escaped as `\x{fffd}`. SG16 should be asked to confirm that escaping first is intended, so that an escaped string shows the original invalid code units. For a non-escaped string, we want the ill-formed sequence to be formatted as �, which the proposed resolution tries to ensure.

History
Date	User	Action	Args
2025-06-15 10:31:39	admin	set	messages: + msg14831
2024-05-08 10:07:31	admin	set	messages: + msg14117
2024-05-08 10:07:31	admin	set	status: new -> open
2024-04-19 15:20:33	admin	set	messages: + msg14064
2024-04-19 00:00:00	admin	create