Issue 1335: Stringizing, extended characters, and universal-character-names

Title: Stringizing, extended characters, and universal-character-names
Status: cd6
Section: 15.7.3 [cpp.stringize]
Submitter: Johannes Schaub

Created on 2011-07-03.00:00:00 last changed 32 months ago

Messages

Date: 2022-02-15.00:00:00

Additional note (February, 2022):

P2314R4 Character sets and encodings (approved in October, 2021) effected changes so that extended characters are no longer translated to UCNs in phase 1.

msg4471 (view)

Date: 2013-08-15.00:00:00

Additional note (August, 2013):

Implementations are granted substantial latitude in their handling of extended characters and universal-character-names in 5.2 [lex.phases] paragraph 1 phase 1, i.e.,

(An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)

However, this freedom is mostly nullified by the requirements of stringizing in 15.7.3 [cpp.stringize] paragraph 2:

If, in the replacement list, a parameter is immediately preceded by a # preprocessing token, both are replaced by a single character string literal preprocessing token that contains the spelling of the preprocessing token sequence for the corresponding argument.

This means that, in order to handle a construct like

  #define STRINGIZE_LITERAL( X ) # X
  #define STRINGIZE( X ) STRINGIZE_LITERAL( X )

  STRINGIZE( STRINGIZE( identifier_\u00fC\U000000Fc ) )

an implementation must recall the original spelling, including the form of UCN and the capitalization of any non-numeric hexadecimal digits, rather than simply translating the characters into a convenient internal representation.

To effect the freedom asserted in 5.2 [lex.phases], the description of stringizing should make the spelling of a universal-character-name implementation-defined.

msg3510 (view)

Date: 2021-10-15.00:00:00

[Resolved at the October, 2021 meeting by paper P2314R4.]

When a string literal containing an extended character is stringized (15.7.3 [cpp.stringize]), the result contains a universal-character-name instead of the original extended character. The reason is that the extended character is translated to a universal-character-name in translation phase 1 (5.2 [lex.phases]), so that the string literal "@" (where @ represents an extended character) becomes "\uXXXX". Because the preprocessing token is a string literal, when the stringizing occurs in translation phase 4, the \ is doubled, and the resulting string literal is "\"\\uXXXX\"". As a result, the universal-character-name is not recognized as such when the translation to the execution character set occurs in translation phase 5. (Note that phase 5 translation does occur if the stringized extended character does not appear in a string literal.) Existing practice appears to ignore these rules and preserve extended characters in stringized string literals, however.

History
Date	User	Action	Args
2022-11-20 07:54:16	admin	set	status: open -> cd6
2022-02-18 07:47:23	admin	set	messages: + msg6703
2022-02-18 07:47:23	admin	set	status: drafting -> open
2013-09-03 00:00:00	admin	set	messages: + msg4471
2011-07-03 00:00:00	admin	create