Issue 411: Use of universal-character-name in character versus string literals

Title: Use of universal-character-name in character versus string literals
Status: cd6
Section: 5.13.5 [lex.string]
Submitter: James Kanze

Created on 2003-04-23.00:00:00 last changed 47 months ago

Messages

msg938 (view)

Date: 2003-10-15.00:00:00

Notes from October 2003 meeting:

We decided we should forward this to the C committee and let them resolve it. Sent via e-mail to John Benito on November 14, 2003.

Reply from John Benito:

I talked this over with the C project editor, we believe this was handled by the C committee before publication of the current standard.

WG14 decided there needed to be a more restrictive rule for one-to-one mappings: rather than saying "a single c-char" as C++ does, the C standard says "a single character that maps to a single-byte execution character"; WG14 fully expect some (if not many or even most) UCNs to map to multiple characters.

Because of the fundamental differences between C and C++ character types, I am not sure the C committee is qualified to answer this satisfactorily for WG21. WG14 is willing to review any decision reached for compatibility.

I hope this helps.

(See also issue 912 for a related question.)

msg842 (view)

Date: 2020-11-15.00:00:00

[Accepted at the November, 2020 meeting as part of paper P2029R4.]

5.13.5 [lex.string] paragraph 5 reads

Escape sequences and universal-character-names in string literals have the same meaning as in character literals, except that the single quote ' is representable either by itself or by the escape sequence \', and the double quote " shall be preceded by a \. In a narrow string literal, a universal-character-name may map to more than one char element due to multibyte encoding.

The first sentence refers us to 5.13.3 [lex.ccon], where we read in the first paragraph that "An ordinary character literal that contains a single c-char has type char [...]." Since the grammar shows that a universal-character-name is a c-char, something like '\u1234' must have type char (and thus be a single char element); in paragraph 5, we read that "A universal-character-name is translated to the encoding, in the execution character set, of the character named. If there is no such encoding, the universal-character-name is translated to an implemenation-defined encoding."

This is in obvious contradiction with the second sentence. In addition, I'm not really clear what is supposed to happen in the case where the execution (narrow-)character set is UTF-8. Consider the character \u0153 (the oe in the French word oeuvre). Should '\u0153' be a char, with an "error" value, say '?' (in conformance with the requirement that it be a single char), or an int, with the two char values 0xC5, 0x93, in an implementation defined order (in conformance with the requirement that a character representable in the execution character set be represented). Supposing the former, should "\u0153" be the equivalent of "?" (in conformance with the first sentence), or "\xC5\x93" (in conformance with the second).

History
Date	User	Action	Args
2022-08-19 07:54:33	admin	set	status: wp -> cd6
2021-02-24 00:00:00	admin	set	status: accepted -> wp
2020-12-15 00:00:00	admin	set	status: open -> accepted
2003-11-15 00:00:00	admin	set	messages: + msg938
2003-04-23 00:00:00	admin	create