Issue 3767: codecvt<charN_t, char8_t, mbstate_t> incorrectly added to locale

Title: codecvt<charN_t, char8_t, mbstate_t> incorrectly added to locale
Status: wp
Section: [locale.category][locale.codecvt.general]
Submitter: Victor Zverovich

Created on 2022-09-05.00:00:00 last changed 15 months ago

Messages

Date: 2024-04-02.10:29:12

Proposed resolution:

This wording is relative to N4928.

Modify [locale.category], Table 105 ([tab:locale.category.facets]) — "Locale category facets" — and Table 106 ([tab:locale.spec]) "Required specializations" as indicated:

Table 105: Locale category facets [tab:locale.category.facets]
Category Includes facets

…

ctype ctype<char>, ctype<wchar_t> codecvt<char, char, mbstate_t> codecvt<char16_t, char8_t, mbstate_t> codecvt<char32_t, char8_t, mbstate_t> codecvt<wchar_t, char, mbstate_t>

…

[…]
Table 106: Required specializations [tab:locale.spec]
Category Includes facets

…

ctype ctype_byname<char>, ctype_byname<wchar_t> codecvt_byname<char, char, mbstate_t> codecvt_byname<char16_t, char8_t, mbstate_t> codecvt_byname<char32_t, char8_t, mbstate_t> codecvt_byname<wchar_t, char, mbstate_t>

…

Modify [locale.codecvt.general] as indicated:

[…]
-3- The specializations required in Table 105 ([locale.category]) convert the implementation-defined native character set. codecvt<char, char, mbstate_t> implements a degenerate conversion; it does not convert at all. The specialization codecvt<char16_t, char8_t, mbstate_t> converts between the UTF-16 and UTF-8 encoding forms, and the specialization codecvt<char32_t, char8_t, mbstate_t> converts between the UTF-32 and UTF-8 encoding forms. codecvt<wchar_t, char, mbstate_t> converts between the native character sets for ordinary and wide characters. Specializations on mbstate_t perform conversion between encodings known to the library implementer. Other encodings can be converted by specializing on a program-defined stateT type. Objects of type stateT can contain any state that is useful to communicate to or from the specialized do_in or do_out members.
Modify [depr.locale.category] (Deprecated locale category facets) in Annex D as indicated:
-1- The ctype locale category includes the following facets as if they were specified in table Table 105 [tab:locale.category.facets] of [locale.codecvt.general].
```
codecvt<char16_t, char, mbstate_t>
codecvt<char32_t, char, mbstate_t>
codecvt<char16_t, char8_t, mbstate_t>
codecvt<char32_t, char8_t, mbstate_t>
```
-1- The ctype locale category includes the following facets as if they were specified in table Table 106 [tab:locale.spec] of [locale.codecvt.general].
```
codecvt_byname<char16_t, char, mbstate_t>
codecvt_byname<char32_t, char, mbstate_t>
codecvt_byname<char16_t, char8_t, mbstate_t>
codecvt_byname<char32_t, char8_t, mbstate_t>
```
-3- The following class template specializations are required in addition to those specified in [locale.codecvt]. The specializations codecvt<char16_t, char, mbstate_t> and codecvt<char16_t, char8_t, mbstate_t> converts between the UTF-16 and UTF-8 encoding forms, and the specializations codecvt<char32_t, char, mbstate_t> and codecvt<char32_t, char8_t, mbstate_t> converts between the UTF-32 and UTF-8 encoding forms.

Table 105: Locale category facets [tab:locale.category.facets]
Category	Includes facets
`…`
ctype	`ctype<char>, ctype<wchar_t> codecvt<char, char, mbstate_t> codecvt<char16_t, char8_t, mbstate_t> codecvt<char32_t, char8_t, mbstate_t> codecvt<wchar_t, char, mbstate_t>`
`…`

Table 106: Required specializations [tab:locale.spec]
Category	Includes facets
`…`
ctype	`ctype_byname<char>, ctype_byname<wchar_t> codecvt_byname<char, char, mbstate_t> codecvt_byname<char16_t, char8_t, mbstate_t> codecvt_byname<char32_t, char8_t, mbstate_t> codecvt_byname<wchar_t, char, mbstate_t>`
`…`

msg13826 (view)

Date: 2024-04-02.10:29:12

[ Tokyo 2024-03-23; Status changed: Voting → WP. ]

msg13416 (view)

Date: 2023-11-07.22:39:32

[ Kona 2023-11-07; move to Ready ]

msg13344 (view)

Date: 2023-02-15.00:00:00

[ 2023-02-10; Victor Zverovich comments and provides improved wording ]

Per today's LWG discussion the following changes have been implemented in revised wording:

Deprecated the facets instead of removing them (also _byname variants which were previously missed).
Removed the changes to facet dtor since with deprecation it's no longer critical to provide other ways to access them.

msg13343 (view)

Date: 2023-02-19.12:51:56

[ Issaquah 2023-02-10; LWG issue processing ]

Removing these breaks most code using them today, because the most obvious way to use them is via use_facet on a locale, which would throw if they're removed (and because they were guaranteed to be present, code using them might have not bothered to check for them using has_facet). Instead of removing them, deprecate the guarantee that they're always present (so move them to [depr.locale.category]). Don't bother changing the destructor. Victor to update wording.

This wording is relative to N4917.

Modify [locale.category], Table 105 ([tab:locale.category.facets]) — "Locale category facets" — as indicated:

Table 105: Locale category facets [tab:locale.category.facets]
Category Includes facets

…

ctype ctype<char>, ctype<wchar_t> codecvt<char, char, mbstate_t> codecvt<char16_t, char8_t, mbstate_t> codecvt<char32_t, char8_t, mbstate_t> codecvt<wchar_t, char, mbstate_t>

…

Modify [locale.codecvt.general] as indicated:
```
namespace std {
  […]
  template<class internT, class externT, class stateT>
    class codecvt : public locale::facet, public codecvt_base {
    public:
      using intern_type = internT;
      using extern_type = externT;
      using state_type = stateT;

      explicit codecvt(size_t refs = 0);
      ~codecvt();

      […]
    protected:
      ~codecvt();
      […]
    };
}
```
[…]
-3- The specializations required in Table ~~105 [tab:locale.category.facets]~~106 [tab:locale.spec] ([locale.category]) convert the implementation-defined native character set. codecvt<char, char, mbstate_t> implements a degenerate conversion; it does not convert at all. The specialization codecvt<char16_t, char8_t, mbstate_t> converts between the UTF-16 and UTF-8 encoding forms, and the specialization codecvt<char32_t, char8_t, mbstate_t> converts between the UTF-32 and UTF-8 encoding forms. codecvt<wchar_t, char, mbstate_t> converts between the native character sets for ordinary and wide characters. Specializations on mbstate_t perform conversion between encodings known to the library implementer. Other encodings can be converted by specializing on a program-defined stateT type. Objects of type stateT can contain any state that is useful to communicate to or from the specialized do_in or do_out members.

Table 105: Locale category facets [tab:locale.category.facets]
Category	Includes facets
`…`
ctype	`ctype<char>, ctype<wchar_t> codecvt<char, char, mbstate_t> codecvt<char16_t, char8_t, mbstate_t> codecvt<char32_t, char8_t, mbstate_t> codecvt<wchar_t, char, mbstate_t>`
`…`

msg12801 (view)

Date: 2022-09-15.00:00:00

[ 2022-09-28; SG16 responds ]

SG16 agrees that the codecvt facets mentioned in LWG3767 "codecvt<charN_t, char8_t, mbstate_t> incorrectly added to locale" are intended to be invariant with respect to locale. Unanimously in favor.

msg12744 (view)

Date: 2022-09-15.00:00:00

[ 2022-09-23; Reflector poll ]

Set priority to 3 after reflector poll. Send to SG16 (then maybe LEWG).

msg12743 (view)

Date: 2022-09-05.00:00:00

Table [tab:locale.category.facets] includes the following two facets:

codecvt<char16_t, char8_t, mbstate_t>
codecvt<char32_t, char8_t, mbstate_t>

However, neither of those actually has anything to do with a locale and therefore it doesn't make sense to dynamically register them with std::locale. Instead they provide conversions between fixed encodings (UTF-8, UTF-16, UTF-32) that are unrelated to locale encodings other than they may happen to coincide with encodings of some locales by accident.

The issue was introduced when adding codecvt<char[16|32]_t, char, mbstate_t> in N2035 which gave no design rationale for using codecvt in the first place. Likely it was trying to do a minimal amount of changes and copied the wording for codecvt<wchar_t, char, mbstate_t> but unfortunately didn't consider encoding implications.

P0482 changed char to char8_t in these facets which made the issue more glaring but unfortunately, despite the breaking change, it failed to address it.

Apart from an obvious design mistake this also adds a small overhead for every locale construction because the implementation has to copy these pseudo-facets for no good reason violating "don't pay for what you don't use" principle.

A simple fix is to remove the two facets from table [tab:locale.category.facets] and make them directly constructible.

History
Date	User	Action	Args
2024-04-02 10:29:12	admin	set	messages: + msg14030
2024-04-02 10:29:12	admin	set	status: voting -> wp
2024-03-18 09:32:04	admin	set	status: ready -> voting
2023-11-07 22:39:32	admin	set	messages: + msg13826
2023-11-07 22:39:32	admin	set	status: open -> ready
2023-02-19 12:51:56	admin	set	messages: + msg13416
2023-02-10 19:49:49	admin	set	messages: + msg13344
2023-02-10 19:49:49	admin	set	messages: + msg13343
2022-09-23 15:44:07	admin	set	messages: + msg12801
2022-09-23 15:44:07	admin	set	status: new -> open
2022-09-06 16:45:39	admin	set	messages: + msg12744
2022-09-05 00:00:00	admin	create