Concatenation of string literals vs translation phases 5 and 6
5.2 [lex.phases]
Tom Honermann

Created on 2020-07-02.00:00:00 last changed 9 months ago


Date: 2020-07-02.00:00:00

According to 5.2 [lex.phases] paragraph 1, concatenation of adjacent string literals is performed in translation phase 6, after conversion of the literal values to the execution character set. However, 5.13.5 [lex.string] paragraph 11 indicates that the interpretation of the string contents is dependent on the encoding-prefixes specified for the literals being concatenated:

In translation phase 6 (5.2 [lex.phases]), adjacent string-literals are concatenated. If both string-literals have the same encoding-prefix, the resulting concatenated string-literal has that encoding-prefix. If one string-literal has no encoding-prefix, it is treated as a string-literal of the same encoding-prefix as the other operand. If a UTF-8 string literal token is adjacent to a wide string literal token, the program is ill-formed. Any other concatenations are conditionally-supported with implementation-defined behavior. [Note: This concatenation is an interpretation, not a conversion. Because the interpretation happens in translation phase 6 (after each character from a string-literal has been translated into a value from the appropriate character set), a string-literal's initial rawness has no effect on the interpretation or well-formedness of the concatenation. —end note]

This seems to indicate that string-literals with different encoding-prefixes are separately converted and then joined, potentially resulting in strings containing code unit sequences corresponding to different character encodings. This reading would contradict the intent, expressed in adjacent table, that, e.g., u"a" "b" means the same as u"ab".

There is implementation divergence in the handling of this specification.

Phases 5 and 6 cannot simply be reversed, because interpretation of escape sequences must precede concatenation, as specified later in the same paragraph:

Characters in concatenated strings are kept distinct.


"\xA" "B"

contains the two characters '\xA' and 'B' after concatenation (and not the single hexadecimal character '\xAB'). —end example]

Richard Smith suggested here that "we should remove phases 5 and 6 entirely, parse one or more string-literal tokens as a string literal expression, and only perform the translation from the contents of the string literal tokens into characters in the execution character set as part of specifying the semantics of a string literal expression."

Date User Action Args
2020-07-02 00:00:00admincreate