Issue 2639: new-lines after phase 1

Title: new-lines after phase 1
Status: c++23
Section: 5.2 [lex.phases]
Submitter: US

Created on 2022-11-03.00:00:00 last changed 36 months ago

Messages

Date: 2022-11-20.07:54:16

Proposed resolution (approved by CWG 2022-11-08):

Change in 5.2 [lex.phases] paragraph 1.1 as follows:

... If an input file is determined to be a UTF-8 file, then it shall be a well-formed UTF-8 code unit sequence and it is decoded to produce a sequence of UCS scalar values that constitutes the sequence of elements of the translation character set. In the resulting sequence, each pair of characters in the input sequence consisting of U+000D CARRIAGE RETURN followed by U+000A LINE FEED, as well as each U+000D CARRIAGE RETURN not immediately followed by a U+000A LINE FEED, is replaced by a single new-line character.

For any other kind of input file supported by the implementation, characters are mapped, in an implementation-defined manner, to a sequence of translation character set elements (5.3.1 [lex.charset]) ~~(introducing new-line characters for~~ , representing end-of-line indicators as new-line characters ).

msg6961 (view)

Date: 2022-11-20.07:54:16

Proposed resolution [SUPERSEDED]:

Change in 5.2 [lex.phases] paragraph 1.1 as follows:

... If an input file is determined to be a UTF-8 file, then it shall be a well-formed UTF-8 code unit sequence and it is decoded to produce a sequence of UCS scalar values that constitutes the sequence of elements of the translation character set, representing each line-termination character or character sequence as a new-line character.

For any other kind of input file supported by the implementation, characters are mapped, in an implementation-defined manner, to a sequence of translation character set elements (5.3.1 [lex.charset]) (~~introducing new-line characters for~~ representing end-of-line indicators as new-line characters).

msg6960 (view)

Date: 2022-11-27.21:00:25

P2720R0 comment US 3-030

[Accepted at the November, 2022 meeting.]

Translation phases 2 and 3 assume that lines are terminated by "new-line characters". However, the current specification of phase 1 does not guarantee that to be true. In particular, for a UTF-8 file the verbatim sequence of source file characters forms the input for phase 2, even on systems where the line terminator is a carriage return. The non-UTF-8 specification is also defective in that it speaks of "introducing" new-line characters, even for encodings like Latin-1 where new-lines might already be present and no "introduction" is needed or appropriate.

History
Date	User	Action	Args
2023-07-16 13:00:43	admin	set	status: open -> c++23
2023-07-16 13:00:43	admin	set	status: wp -> open
2023-02-18 18:43:04	admin	set	status: accepted -> wp
2022-11-25 05:14:04	admin	set	status: nb -> accepted
2022-11-20 07:54:16	admin	set	messages: + msg7053
2022-11-20 07:54:16	admin	set	status: open -> nb
2022-11-07 07:40:31	admin	set	messages: + msg6961
2022-11-03 00:00:00	admin	create