Title
Universal-character-names in comments
Status
cd6
Section
5.7 [lex.comment]
Submitter
David Krauss

Created on 2011-10-05.00:00:00 last changed 21 months ago

Messages

Date: 2022-02-15.00:00:00

Additional note (February, 2022):

P2314R4 Character sets and encodings (approved in October, 2021) effected changes so that extended characters are no longer translated to UCNs in phase 1.

Date: 2023-02-12.17:09:13

[ Resolved by P2314R4, adopted in October, 2021. ]

According to 5.3 [lex.charset] paragraph 2,

If the hexadecimal value for a universal-character-name corresponds to a surrogate code point (in the range 0xD800-0xDFFF, inclusive), the program is ill-formed. Additionally, if the hexadecimal value for a universal-character-name outside the c-char-sequence, s-char-sequence, or r-char-sequence of a character or string literal corresponds to a control character (in either of the ranges 0x00-0x1F or 0x7F-0x9F, both inclusive) or to a character in the basic source character set, the program is ill-formed.

These restrictions should not apply to comment text. Arguably the prohibitions of control characters and characters in the basic character set already do not apply, as they require that the preprocessing tokens for literals have already been recognized; this occurs in phase 3, which also replaces comments with single spaces. However, the prohibition of surrogate code points is not so limited and might conceivably be applied within comments.

Probably the most straightforward way of addressing this problem would be simply to state in 5.7 [lex.comment] that character sequences that resemble universal-character-names are not recognized as such within comment text.

History
Date User Action Args
2023-02-12 17:09:13adminsetstatus: review -> cd6
2022-02-18 07:47:23adminsetmessages: + msg6705
2022-02-18 07:47:23adminsetstatus: open -> review
2011-10-05 00:00:00admincreate