Title
Treatment of universal-character-names outside of string-literals
Status
open
Section
5.2 [lex.phases]
Submitter
Jay Ghiron

Created on 2026-04-27.00:00:00 last changed 2 weeks ago

Messages

Date: 2026-05-03.18:49:34

(From submission #893.)

According to 5.2 [lex.phases] bullet 1.3, universal-character-names (outside of string-literals) are replaced in phase 3 with their corresponding (single) character:

... As characters from the source file are consumed to form the next preprocessing token (i.e., not being consumed as part of a comment or other forms of whitespace), except when matching a c-char-sequence, s-char-sequence, r-char-sequence, h-char-sequence, or q-char-sequence, universal-character-names are recognized (5.3.2 [lex.universal.char]) and replaced by the designated element of the translation character set (5.3.1 [lex.charset]). ...

This rule (and the surrounding change in treatment of UCNs) was introduced by paper P2314R4 (adopted in October, 2021).

Consider:

  #define X π
  #define X \u03C0   // clang and MSVC (old preprocessor) accept; gcc, EDG, and MSVC (new preprocessor) warn about incompatible macro redefinition

Also consider:

   #include <stdio.h>

   #define S1(...) # __VA_ARGS__
   #define S2(...) # __VA_OPT__(__VA_ARGS__)

   int main(){
     #define X \u03C0
     printf("%s %s\n", S1(X), S1(S1(X)));    // output on all implementations: X S1(X)
     printf("%s %s\n", S2(X), S2(S2(X)));    // output on all implementations: π "\u03C0"

     #define Y π
     printf("%s %s\n", S2(Y), S2(S2(Y)));    // output on all implementations: π "π"
   }

Note that 15.7.3 [cpp.stringize] paragraph 2 talks about "original spelling", which might be interpreted as retaining UCNs:

... Otherwise, the original spelling of each preprocessing token in the stringizing argument is retained in the character string literal, except for special handling for producing the spelling of header-names, character-literals, and string-literals ...

Furthermore, there is the question whether universal-character-names can be formed using ## concatenation (godbolt):

  #define CAT(X,Y) X ## Y
  #define Y CAT(\,u03C0)
  int Y;                  // clang, gcc, EDG accept; MSVC (new preprocessor) rejects, because no valid preprocessing token is formed

Paper P2621R3 (adopted in June, 2023) added the following note to 15.7.4 [cpp.concat] paragraph 3:

[Note 1: Concatenation can form a universal-character-name (5.3.1 [lex.charset]). —end note]

It is unclear what the normative basis for that note is, given that concatenation does not branch back to phase 3 where UCN recognition would happen. The implementation survey in P2621R3 indicated widespread implementation support for forming UCNs via concatenation.

History
Date User Action Args
2026-04-27 00:00:00admincreate