Title
char16_t string literals and surrogate pairs
Status
cd4
Section
5.13.5 [lex.string]
Submitter
Jeffrey Yasskin

Created on 2013-10-30.00:00:00 last changed 49 months ago

Messages

Date: 2014-11-15.00:00:00

[Moved to DR at the November, 2014 meeting.]

Date: 2014-02-15.00:00:00

Proposed resolution (February, 2014):

Change 5.13.5 [lex.string] paragraph 16 as follows:

Escape sequences and universal-character-names in non-raw string literals have the same meaning as in character literals (5.13.3 [lex.ccon]), except that the single quote ' is representable either by itself or by the escape sequence \', and the double quote " shall be preceded by a \, and except that a universal-character-name in a char16_t string literal may yield a surrogate pair. In a narrow string literal...
Date: 2013-10-30.00:00:00

The intent of char16_t string literals, as evident from 5.13.5 [lex.string] paragraph 9, is that they be encoded in UTF-16, that is, including surrogate pairs to represent code points outside the basic multi-lingual plane:

A single c-char may produce more than one char16_t character in the form of surrogate pairs.

Paragraph 15, however, is inconsistent with this approach, saying,

Escape sequences and universal-character-names in non-raw string literals have the same meaning as in character literals (5.13.3 [lex.ccon]), except that the single quote ' is representable either by itself or by the escape sequence \', and the double quote " shall be preceded by a \.

The reason is that code points outside the basic multi-lingual plane are ill-formed in char16_t character literals:

A character literal that begins with the letter u, such as u'y', is a character literal of type char16_t. The value of a char16_t literal containing a single c-char is equal to its ISO 10646 code point value, provided that the code point is representable with a single 16-bit code unit. (That is, provided it is a basic multi-lingual plane code point.) If the value is not representable within 16 bits, the program is ill-formed.

It should be clarified that this restriction does not apply to char16_t string literals.

History
Date User Action Args
2017-02-06 00:00:00adminsetstatus: drwp -> cd4
2015-05-25 00:00:00adminsetstatus: dr -> drwp
2015-04-13 00:00:00adminsetmessages: + msg5324
2014-11-24 00:00:00adminsetstatus: ready -> dr
2014-03-03 00:00:00adminsetmessages: + msg4798
2014-03-03 00:00:00adminsetstatus: open -> ready
2013-10-30 00:00:00admincreate