Issue 872: Lexical issues with raw strings

Title: Lexical issues with raw strings
Status: cd2
Section: 5.13.5 [lex.string]
Submitter: Joseph Myers

Created on 2009-04-16.00:00:00 last changed 186 months ago

Messages

msg2652 (view)

Date: 2010-03-15.00:00:00

[Voted into WP at March, 2010 meeting as document N3077.]

msg2334 (view)

Date: 2009-10-15.00:00:00

Proposed resolution (October, 2009):

Change the grammar in 5.13.5 [lex.string] as follows:

d-char:

[

]

the backslash \,

Change 5.13.5 [lex.string] paragraph 2 as follows:

A string literal that has an R in the prefix is a raw string literal. The d-char-sequence serves as a delimiter. The terminating d-char-sequence of a raw-string is the same sequence of characters as the initial d-char-sequence. A d-char-sequence shall consist of at most 16 characters. If the input stream contains a sequence of characters that could be the prefix and initial double quote of a raw string literal, such as R", those characters are considered to begin a raw string literal even if that literal is not well-formed. [Example:
  #define R "x"
  const char* s = R"y"; // ill-formed raw string, not "x" "y"
—end example]

msg2049 (view)

Date: 2009-11-08.00:00:00

Additional note, June, 2009:

The translation of characters that are not in the basic source character set into universal-character-names in translation phase 1 raises an additional problem: each such character will occupy at least six of the 16 r-chars that are permitted. Thus, for example, R"@@@[]@@@" is ill-formed because @@@ becomes \u0040\u0040\u0040, which is 18 characters.

One possibility for addressing this might be to disallow the \ character completely as an d-char, which would have the effect of restricting r-chars to the basic source character set.

msg2048 (view)

Date: 2009-08-03.00:00:00

The specification of raw string literals interacts poorly with the specification of preprocessing tokens. The grammar in 5.5 [lex.pptoken] has a production reading

each non-white-space character that cannot be one of the above

This is echoed in the max-munch rule in paragraph 3:

If the input stream has been parsed into preprocessing tokens up to a given character, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token, even if that would cause further lexical analysis to fail.

This raises questions about the handling of raw string literals. Consider, for instance,

    #define R "x"
    const char* s = R"y";

The character sequence R"y" does not satisfy the syntactic requirements for a raw string. Should it be diagnosed as an ill-formed attempt at a raw string, or should it be well-formed, interpreting R as a preprocessor token that is a macro name and thus initializing s with a pointer to the string "xy"?

For another example, consider:

    #define R "]"
    const char* x = R"foo[";

Presumably this means that the entire rest of the file must be scanned for the characters ]foo" and, if they are not found, macro-expand R and initialize x with a pointer to the string "]foo[". Is this the intended result?

Finally, does the requirement in 5.13.5 [lex.string] that

A d-char-sequence shall consist of at most 16 characters.

mean that

    #define R "x"
    const char* y = R"12345678901234567[y]12345678901234567";

is ill-formed, or a valid initialization of y with a pointer to the string "x12345678901234567[y]12345678901234567"?

History
Date	User	Action	Args
2010-03-29 00:00:00	admin	set	messages: + msg2652
2010-03-29 00:00:00	admin	set	status: ready -> cd2
2009-11-08 00:00:00	admin	set	messages: + msg2334
2009-11-08 00:00:00	admin	set	status: open -> ready
2009-06-19 00:00:00	admin	set	messages: + msg2049
2009-04-16 00:00:00	admin	create