Created on 2024-04-19.00:00:00 last changed 6 months ago
Proposed resolution:
This wording is relative to N4981.
Modify [fs.path.fmtr.funcs] as indicated:
template<class FormatContext> typename FormatContext::iterator format(const filesystem::path& p, FormatContext& ctx) const;
-5- Effects: Let `s` bep.generic_string<filesystem::path::value_type>()
if the `g` option is used, otherwise `p.native()`. Writes `s` into `ctx.out()`, adjusted according to the path-format-spec. If `charT` is `char`, `path::value_type` is `wchar_t`, and the literal encoding is UTF-8, then theescaped path(possibly escaped) string is transcoded from the native encoding for wide character strings to UTF-8 with maximal subparts of ill-formed subsequences substituted with u+fffd replacement character per the Unicode Standard, Chapter 3.9 u+fffd Substitution in Conversion. If `charT` and `path::value_type` are the same then no transcoding is performed. Otherwise, transcoding is implementation-defined.
transcoding of a formatted `path` when `charT` and `path::value_type` differ and not converting from `wchar_t` to UTF-8
[ 2024-05-08; Reflector poll ]
Set priority to 2 after reflector poll.
[fs.path.fmtr.funcs] says:
If `charT` is `char`, `path::value_type` is `wchar_t`, and the literal encoding is UTF-8, then the escaped path is transcoded from the native encoding for wide character strings to UTF-8 with maximal subparts of ill-formed subsequences substituted with u+fffd replacement character per the Unicode Standard [...]. Otherwise, transcoding is implementation-defined.
This seems to mean that the Unicode substitutions are only done for an escaped path, i.e. when the `?` option is used. Otherwise, the form of transcoding is completely implementation-defined. However, this makes no sense. An escaped string will have no ill-formed subsequences, because they will already have been replaced as per [format.string.escaped]:
Otherwise (X is a sequence of ill-formed code units), each code unit U is appended to E in order as the sequence\x{hex-digit-sequence}
, wherehex-digit-sequence
is the shortest hexadecimal representation of U using lower-case hexadecimal digits.
So only unescaped strings can have ill-formed sequences by the time we do transcoding to `char`, but whether or not any u+fffd substitution occurs is just implementation-defined.
I believe we want to specify the substitutions are done when transcoding an unescaped path (and it doesn't matter whether we specify it for escaped paths, because it's a no-op if escaping happens first, as is apparently intended).
It does matter whether we escape first or perform substitutions first. If we escape first then every code unit in an ill-formed sequence is individually escaped as `\x{hex-digit-sequence}`. So an ill-formed sequence of two `wchar_t` values will be escaped as two `\x{...}` strings, which are then transcoded to UTF-8. If we transcode (with substitutions first) then the entire ill-formed sequence is replaced with a single replacement character, which will then be escaped as `\x{fffd}`. SG16 should be asked to confirm that escaping first is intended, so that an escaped string shows the original invalid code units. For a non-escaped string, we want the ill-formed sequence to be formatted as �, which the proposed resolution tries to ensure.
History | |||
---|---|---|---|
Date | User | Action | Args |
2024-05-08 10:07:31 | admin | set | messages: + msg14117 |
2024-05-08 10:07:31 | admin | set | status: new -> open |
2024-04-19 15:20:33 | admin | set | messages: + msg14064 |
2024-04-19 00:00:00 | admin | create |