Created on 2009-03-03.00:00:00 last changed 150 months ago
[Voted into WP at March, 2010 meeting.]
Proposed resolution (February, 2010):
Change 5.2 [lex.phases] paragraph 1 phase 2 as follows:
If asource file that is not empty does not end in a new-line character, or ends in a new-line character immediately preceded by a backslash character before any such splicing takes place, the behavior is undefined.
Change 5.8 [lex.header] paragraph 2 as follows:
Ifeither of the characters ' or \ ,or either of the character sequences /* or // appearsin a q-char-sequence or ah-char-sequence , orthe character " appearsin ah-char-sequence , the behavior is undefined. [Footnote: Thus, sequence sof characters that resemble escape sequence s cause undefined behavior. —end footnote]
Notes from the October, 2009 meeting:
The CWG decided that the non-UCN aspects of this issue should be resolved, while the overall questions regarding trigraphs, UCNs, and raw strings will be investigated separately.
Additional note, March, 2009:
The undefined behavior referred to above regarding universal-character-names is the result of the considerations described in the C99 Rationale, section 5.2.1, in the part entitled “UCN models.” Three different models for support of UCNs are described, each involving different conversions between UCNs and wide characters and/or at different times during program translation. Implementations, as well as the specification in a language standard, can employ any of the three, but it must be impossible for a well-defined program to determine which model was actually employed by implementation. The implication of this “equivalence principle” is that any construct that would give different results under the different models must be classified as undefined behavior. For example, an apparent UCN resulting from a line-splice would be recognized as a UCN by an implementation in which all wide characters were translated immediately into UCNs, as described in C++ phase 1, but would not be recognized as a UCN by another implementation in which all UCNs were translated immediately into wide characters (a possibility mentioned parenthetically in C++ phase 1).
There are additional implications for this “equivalence principle” beyond the ones identified in the UK CD comments. See also issue 578; presumably a string like the one in that issue should also be described as having undefined behavior. Also, because C++'s model introduces backslash characters as part of UCNs for any character outside the basic source character set, any header-name that contains such a character (e.g., #include "@.h") will have undefined behavior in C++. This is also the reason that UCNs are translated into wide characters inside raw strings: two of the three models articulated in the C99 Rationale translate to or from UCNs in phase 1, before raw strings are recognized as tokens in phase 3, so raw strings cannot treat UCNs differently from the way they are treated in other contexts. See also issue 789 for similar points regarding trigraphs.
There are several instances of undefined behavior in lexical processing:
5.2 [lex.phases] paragraph 1, phase 2: a universal-character-name resulting from a line splice.
5.2 [lex.phases] paragraph 1, phase 2: a file ending without a new-line character or with a new-line character that is spliced away.
5.2 [lex.phases] paragraph 1, phase 4: a universal-character-name resulting from macro token concatenation.
5.8 [lex.header] paragraph 2: ', \, /*, //, or " appearing in a header-name.
These would be more appropriately handled as conditionally-supported behavior, requiring implementations either to document their handling of these constructs or to issue a diagnostic.
|2010-03-29 00:00:00||admin||set||messages: + msg2648|
|2010-03-29 00:00:00||admin||set||status: review -> cd2|
|2010-02-16 00:00:00||admin||set||messages: + msg2519|
|2010-02-16 00:00:00||admin||set||status: drafting -> review|
|2009-11-08 00:00:00||admin||set||messages: + msg2382|
|2009-03-23 00:00:00||admin||set||messages: + msg1936|