Issue 1332: Handling of invalid universal-character-names

Title: Handling of invalid universal-character-names
Status: cd5
Section: 5.3.1 [lex.charset]
Submitter: Mike Miller

Created on 2011-06-20.00:00:00 last changed 66 months ago

Messages

msg6524 (view)

Date: 2021-02-24.00:00:00

Additional note, February, 2021:

This issue was resolved editorially in N4842.

msg3511 (view)

Date: 2011-06-20.00:00:00

According to 5.3.1 [lex.charset] paragraph 2,

The character designated by the universal-character-name \UNNNNNNNN is that character whose character short name in ISO/IEC 10646 is NNNNNNNN; the character designated by the universal-character-name \uNNNN is that character whose character short name in ISO/IEC 10646 is 0000NNNN. If the hexadecimal value for a universal-character-name corresponds to a surrogate code point (in the range 0xD800-0xDFFF, inclusive), the program is ill-formed. Additionally, if the hexadecimal value for a universal-character-name outside the c-char-sequence, s-char-sequence, or r-char-sequence of a character or string literal corresponds to a control character (in either of the ranges 0x00-0x1F or 0x7F-0x9F, both inclusive) or to a character in the basic source character set, the program is ill-formed.

It is not specified what should happen if the hexadecimal value does not designate a Unicode code point: is that undefined behavior or does it make the program ill-formed?

As an aside, a note should be added explaining why these requirements apply to to an r-char-sequence when, as the footnote at the end of the paragraph explains,

A sequence of characters resembling a universal-character-name in an r-char-sequence (5.13.5 [lex.string]) does not form a universal-character-name.

History
Date	User	Action	Args
2021-02-24 00:00:00	admin	set	messages: + msg6524
2021-02-24 00:00:00	admin	set	status: drafting -> cd5
2011-06-20 00:00:00	admin	create