Issue 3187: Treatment of universal-character-names outside of string-literals

Title: Treatment of universal-character-names outside of string-literals
Status: open
Section: 5.2 [lex.phases]
Submitter: Jay Ghiron

Created on 2026-04-27.00:00:00 last changed 1 month ago

Messages

msg8588 (view)

Date: 2026-05-30.18:42:11

CWG 2026-05-29

CWG agreed that the specification of concatenation should be adjusted to recognize UCNs.

Partial possible resolution:

Change in 15.7.4 [cpp.concat] paragraph 3 as follows:

For both object-like and function-like macro invocations, before the replacement list is reexamined for more macro names to replace, each instance of a ## preprocessing token in the replacement list (not from an argument), together with its immediately preceding and immediately following preprocessing token, is deleted and ~~the preceding preprocessing token is concatenated with the following preprocessing token~~ replaced by a single preprocessing token formed by concatenating the spellings of the preceding and following preprocessing tokens and replacing any universal-character-names with the characters they designate (5.3.1 [lex.charset]). If no single valid preprocessing token can be formed (5.5 [lex.pptoken]), the program is ill-formed. Placemarker preprocessing tokens are handled specially: concatenation of two placemarkers results in a single placemarker preprocessing token, and concatenation of a placemarker with a non-placemarker preprocessing token results in the non-placemarker preprocessing token. ~~[Note 1: Concatenation can form a universal-character-name (5.3.1 [lex.charset]). —end note]~~

Add an example after 15.7.4 [cpp.concat] paragraph 4 as follows:

#define CAT(X,Y)  X ## Y
CAT(*,*)                    // error: ** is not a single preprocessing token

msg8565 (view)

Date: 2026-05-03.18:49:34

(From submission #893.)

According to 5.2 [lex.phases] bullet 1.3, universal-character-names (outside of string-literals) are replaced in phase 3 with their corresponding (single) character:

... As characters from the source file are consumed to form the next preprocessing token (i.e., not being consumed as part of a comment or other forms of whitespace), except when matching a c-char-sequence, s-char-sequence, r-char-sequence, h-char-sequence, or q-char-sequence, universal-character-names are recognized (5.3.2 [lex.universal.char]) and replaced by the designated element of the translation character set (5.3.1 [lex.charset]). ...

This rule (and the surrounding change in treatment of UCNs) was introduced by paper P2314R4 (adopted in October, 2021).

Consider:

  #define X π
  #define X \u03C0   // clang and MSVC (old preprocessor) accept; gcc, EDG, and MSVC (new preprocessor) warn about incompatible macro redefinition

Also consider:

   #include <stdio.h>

   #define S1(...) # __VA_ARGS__
   #define S2(...) # __VA_OPT__(__VA_ARGS__)

   int main(){
     #define X \u03C0
     printf("%s %s\n", S1(X), S1(S1(X)));    // output on all implementations: X S1(X)
     printf("%s %s\n", S2(X), S2(S2(X)));    // output on all implementations: π "\u03C0"

     #define Y π
     printf("%s %s\n", S2(Y), S2(S2(Y)));    // output on all implementations: π "π"
   }

Note that 15.7.3 [cpp.stringize] paragraph 2 talks about "original spelling", which might be interpreted as retaining UCNs:

... Otherwise, the original spelling of each preprocessing token in the stringizing argument is retained in the character string literal, except for special handling for producing the spelling of header-names, character-literals, and string-literals ...

Furthermore, there is the question whether universal-character-names can be formed using ## concatenation (godbolt):

  #define CAT(X,Y) X ## Y
  #define Y CAT(\,u03C0)
  int Y;                  // clang, gcc, EDG accept; MSVC (new preprocessor) rejects, because no valid preprocessing token is formed

Paper P2621R3 (adopted in June, 2023) added the following note to 15.7.4 [cpp.concat] paragraph 3:

[Note 1: Concatenation can form a universal-character-name (5.3.1 [lex.charset]). —end note]

It is unclear what the normative basis for that note is, given that concatenation does not branch back to phase 3 where UCN recognition would happen. The implementation survey in P2621R3 indicated widespread implementation support for forming UCNs via concatenation.

History
Date	User	Action	Args
2026-05-30 18:42:11	admin	set	messages: + msg8588
2026-04-27 00:00:00	admin	create