Lexical issues with raw strings
5.13.5 [lex.string]
Joseph Myers

Created on 2009-04-16.00:00:00 last changed 143 months ago


Date: 2010-03-15.00:00:00

[Voted into WP at March, 2010 meeting as document N3077.]

Date: 2009-10-15.00:00:00

Proposed resolution (October, 2009):

  1. Change the grammar in 5.13.5 [lex.string] as follows:

    • d-char:
        any member of the basic source character set except:
          space, the left square bracket [, the right square bracket ], the backslash \, and the control characters representing horizontal tab, vertical tab, form feed, and newline.
  2. Change 5.13.5 [lex.string] paragraph 2 as follows:

  3. A string literal that has an R in the prefix is a raw string literal. The d-char-sequence serves as a delimiter. The terminating d-char-sequence of a raw-string is the same sequence of characters as the initial d-char-sequence. A d-char-sequence shall consist of at most 16 characters. If the input stream contains a sequence of characters that could be the prefix and initial double quote of a raw string literal, such as R", those characters are considered to begin a raw string literal even if that literal is not well-formed. [Example:

      #define R "x"
      const char* s = R"y"; // ill-formed raw string, not "x" "y"

    end example]

Date: 2009-11-08.00:00:00

Additional note, June, 2009:

The translation of characters that are not in the basic source character set into universal-character-names in translation phase 1 raises an additional problem: each such character will occupy at least six of the 16 r-chars that are permitted. Thus, for example, R"@@@[]@@@" is ill-formed because @@@ becomes \u0040\u0040\u0040, which is 18 characters.

One possibility for addressing this might be to disallow the \ character completely as an d-char, which would have the effect of restricting r-chars to the basic source character set.

Date: 2009-08-03.00:00:00

The specification of raw string literals interacts poorly with the specification of preprocessing tokens. The grammar in 5.4 [lex.pptoken] has a production reading

    each non-white-space character that cannot be one of the above

This is echoed in the max-munch rule in paragraph 3:

If the input stream has been parsed into preprocessing tokens up to a given character, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token, even if that would cause further lexical analysis to fail.

This raises questions about the handling of raw string literals. Consider, for instance,

    #define R "x"
    const char* s = R"y";

The character sequence R"y" does not satisfy the syntactic requirements for a raw string. Should it be diagnosed as an ill-formed attempt at a raw string, or should it be well-formed, interpreting R as a preprocessor token that is a macro name and thus initializing s with a pointer to the string "xy"?

For another example, consider:

    #define R "]"
    const char* x = R"foo[";

Presumably this means that the entire rest of the file must be scanned for the characters ]foo" and, if they are not found, macro-expand R and initialize x with a pointer to the string "]foo[". Is this the intended result?

Finally, does the requirement in 5.13.5 [lex.string] that

A d-char-sequence shall consist of at most 16 characters.

mean that

    #define R "x"
    const char* y = R"12345678901234567[y]12345678901234567";

is ill-formed, or a valid initialization of y with a pointer to the string "x12345678901234567[y]12345678901234567"?

Date User Action Args
2010-03-29 00:00:00adminsetmessages: + msg2652
2010-03-29 00:00:00adminsetstatus: ready -> cd2
2009-11-08 00:00:00adminsetmessages: + msg2334
2009-11-08 00:00:00adminsetstatus: open -> ready
2009-06-19 00:00:00adminsetmessages: + msg2049
2009-04-16 00:00:00admincreate