Title
Treatment of universal-character-names outside of string-literals
Status
open
Section
5.2 [lex.phases]
Submitter
Jay Ghiron

Created on 2026-04-27.00:00:00 last changed 2 days ago

Messages

Date: 2026-05-30.18:42:11

CWG 2026-05-29

CWG agreed that the specification of concatenation should be adjusted to recognize UCNs.

Partial possible resolution:

  1. Change in 15.7.4 [cpp.concat] paragraph 3 as follows:

    For both object-like and function-like macro invocations, before the replacement list is reexamined for more macro names to replace, each instance of a ## preprocessing token in the replacement list (not from an argument), together with its immediately preceding and immediately following preprocessing token, is deleted and the preceding preprocessing token is concatenated with the following preprocessing token replaced by a single preprocessing token formed by concatenating the spellings of the preceding and following preprocessing tokens and replacing any universal-character-names with the characters they designate (5.3.1 [lex.charset]). If no single valid preprocessing token can be formed (5.5 [lex.pptoken]), the program is ill-formed. Placemarker preprocessing tokens are handled specially: concatenation of two placemarkers results in a single placemarker preprocessing token, and concatenation of a placemarker with a non-placemarker preprocessing token results in the non-placemarker preprocessing token. [Note 1: Concatenation can form a universal-character-name (5.3.1 [lex.charset]). —end note]
  2. Add an example after 15.7.4 [cpp.concat] paragraph 4 as follows:

    #define CAT(X,Y)  X ## Y
    CAT(*,*)                    // error: ** is not a single preprocessing token
    
Date: 2026-05-03.18:49:34

(From submission #893.)

According to 5.2 [lex.phases] bullet 1.3, universal-character-names (outside of string-literals) are replaced in phase 3 with their corresponding (single) character:

... As characters from the source file are consumed to form the next preprocessing token (i.e., not being consumed as part of a comment or other forms of whitespace), except when matching a c-char-sequence, s-char-sequence, r-char-sequence, h-char-sequence, or q-char-sequence, universal-character-names are recognized (5.3.2 [lex.universal.char]) and replaced by the designated element of the translation character set (5.3.1 [lex.charset]). ...

This rule (and the surrounding change in treatment of UCNs) was introduced by paper P2314R4 (adopted in October, 2021).

Consider:

  #define X π
  #define X \u03C0   // clang and MSVC (old preprocessor) accept; gcc, EDG, and MSVC (new preprocessor) warn about incompatible macro redefinition

Also consider:

   #include <stdio.h>

   #define S1(...) # __VA_ARGS__
   #define S2(...) # __VA_OPT__(__VA_ARGS__)

   int main(){
     #define X \u03C0
     printf("%s %s\n", S1(X), S1(S1(X)));    // output on all implementations: X S1(X)
     printf("%s %s\n", S2(X), S2(S2(X)));    // output on all implementations: π "\u03C0"

     #define Y π
     printf("%s %s\n", S2(Y), S2(S2(Y)));    // output on all implementations: π "π"
   }

Note that 15.7.3 [cpp.stringize] paragraph 2 talks about "original spelling", which might be interpreted as retaining UCNs:

... Otherwise, the original spelling of each preprocessing token in the stringizing argument is retained in the character string literal, except for special handling for producing the spelling of header-names, character-literals, and string-literals ...

Furthermore, there is the question whether universal-character-names can be formed using ## concatenation (godbolt):

  #define CAT(X,Y) X ## Y
  #define Y CAT(\,u03C0)
  int Y;                  // clang, gcc, EDG accept; MSVC (new preprocessor) rejects, because no valid preprocessing token is formed

Paper P2621R3 (adopted in June, 2023) added the following note to 15.7.4 [cpp.concat] paragraph 3:

[Note 1: Concatenation can form a universal-character-name (5.3.1 [lex.charset]). —end note]

It is unclear what the normative basis for that note is, given that concatenation does not branch back to phase 3 where UCN recognition would happen. The implementation survey in P2621R3 indicated widespread implementation support for forming UCNs via concatenation.

History
Date User Action Args
2026-05-30 18:42:11adminsetmessages: + msg8588
2026-04-27 00:00:00admincreate