Title
Formatters converting sequences of char to sequences of wchar_t
Status
ready
Section
[format.formatter.spec]
Submitter
Mark de Wever

Created on 2023-06-01.00:00:00 last changed 1 month ago

Messages

Date: 2024-03-18.10:24:24

Proposed resolution:

This wording is relative to N4950.

  1. Modify [format.formatter.spec] as indicated:

    [Drafting note: The unwanted conversion happens due to the formatter base class specialization ([format.range.fmtdef])

    struct range-default-formatter<range_format::sequence, R, charT>
    

    which is defined the header <format>. Therefore the disabling is only needed in this header) — end drafting note]

    -2- […]

    The parse member functions of these formatters interpret the format specification as a std-format-spec as described in [format.string.std].

    [Note 1: Specializations such as formatter<wchar_t, char> and formatter<const char*, wchar_t> that would require implicit multibyte / wide string or character conversion are disabled. — end note]

    -?- The header <format> provides the following disabled specializations:

    1. (?.1) — The string type specializations

      template<> struct formatter<char*, wchar_t>;
      template<> struct formatter<const char*, wchar_t>;
      template<size_t N> struct formatter<char[N], wchar_t>;
      template<class traits, class Allocator>
        struct formatter<basic_string<char, traits, Allocator>, wchar_t>;
      template<class traits>
        struct formatter<basic_string_view<char, traits>, wchar_t>;
      

    -3- For any types T and charT for which neither the library nor the user provides an explicit or partial specialization of the class template formatter, formatter<T, charT> is disabled.

Date: 2024-03-15.00:00:00

[ 2024-03-18; Tokyo: move to Ready ]

Date: 2023-07-15.00:00:00

[ 2023-07-26; Mark de Wever provides wording confirmed by SG16 ]

Date: 2023-06-15.00:00:00

[ 2023-06-08; Reflector poll ]

Set status to SG16 and priority to 3 after reflector poll.

Date: 2023-06-01.00:00:00

I noticed some interesting features introduced by the range based formatters in C++23

// Ill-formed in C++20 and C++23
const char* cstr = "hello";
char* str = const_cast<char*>(cstr);
std::format(L"{}", str);
std::format(L"{}",cstr);

// Ill-formed in C++20
// In C++23 they give L"['h', 'e', 'l', 'l', 'o']"
std::format(L"{}", "hello"); // A libc++ bug prevents this from working.
std::format(L"{}", std::string_view("hello"));
std::format(L"{}", std::string("hello"));
std::format(L"{}", std::vector{'h', 'e', 'l', 'l', 'o'});

An example is shown here. This only shows libc++ since libstdc++ and MSVC STL have not implemented the formatting ranges papers (P2286R8 and P2585R0) yet.

The difference between C++20 and C++23 is the existence of range formatters. These formatters use the formatter specialization formatter<char, wchar_t> which converts the sequence of chars to a sequence of wchar_ts.

In this conversion same_as<char, charT> is false, thus the requirements of the range-type s and ?s ([tab:formatter.range.type]) aren't met. So the following is ill-formed:

std::format(L"{:s}", std::string("hello")); // Not L"hello"

It is surprising that some string types can be formatted as a sequence of wide-characters, but others not. A sequence of characters can be a sequence UTF-8 code units. This is explicitly supported in the width estimation of string types. The conversion of char to wchar_t will convert the individual code units, which will give incorrect results for multi-byte code points. It will not transcode UTF-8 to UTF-16/32. The current behavior is not in line with the note in [format.formatter.spec]/2

[Note 1: Specializations such as formatter<wchar_t, char> and formatter<const char*, wchar_t> that would require implicit multibyte / wide string or character conversion are disabled. — end note]

Disabling this could be done by explicitly disabling the char to wchar_t sequence formatter. Something along the lines of

template<ranges::input_range R>
  requires(format_kind<R> == range_format::sequence &&
           same_as<remove_cvref_t<ranges::range_reference_t<R>>, char>)
struct formatter<R, wchar_t> : __disabled_formatter {};

where __disabled_formatter satisfies [format.formatter.spec]/5, would do the trick. This disables the conversion for all sequences not only the string types. So vector, array, span, etc. would be disabled.

This does not disable the conversion in the range_formatter. This allows users to explicitly opt in to this formatter for their own specializations.

An alternative would be to only disable this conversion for string type specializations ([format.formatter.spec]/2.2) where char to wchar_t is used:

template<size_t N> struct formatter<charT[N], charT>;
template<class traits, class Allocator>
  struct formatter<basic_string<charT, traits, Allocator>, charT>;
template<class traits>
  struct formatter<basic_string_view<charT, traits>, charT>;

Disabling following the following two is not strictly required:

template<> struct formatter<char*, wchar_t>;
template<> struct formatter<const char*, wchar_t>;

However, if (const) char* becomes an input_range in a future version C++, these formatters would become enabled. Disabling all five instead of the three required specializations seems like a future proof solution.

Since there is no enabled narrowing formatter specialization

template<> struct formatter<wchar_t, char>;

there are no issues for wchar_t to char conversions.

Before proceeding with a proposed resolution the following design questions need to be addressed:

  • Do we want to allow string types of chars to be formatted as sequences of wchar_ts?

  • Do we want to allow non string type sequences of chars to be formatted as sequences of wchar_ts?

  • Should we disable char to wchar_t conversion in the range_formatter?

SG16 has indicated they would like to discuss this issue during a telecon.

History
Date User Action Args
2024-03-18 10:24:24adminsetmessages: + msg14019
2024-03-18 10:24:24adminsetstatus: open -> ready
2023-07-29 15:39:42adminsetmessages: + msg13695
2023-07-29 15:39:42adminsetmessages: + msg13694
2023-06-08 20:31:48adminsetmessages: + msg13617
2023-06-08 20:31:48adminsetstatus: new -> open
2023-06-01 00:00:00admincreate