Issue 2331: regex_constants::collate's effects are inaccurately summarized

Title: regex_constants::collate's effects are inaccurately summarized
Status: open
Section: [re.synopt]
Submitter: Stephan T. Lavavej

Created on 2013-09-21.00:00:00 last changed 123 months ago

Messages

Date: 2015-05-08.19:54:16

Proposed resolution:

This wording is relative to N3691.

In [re.synopt]/1, Table 138 — "syntax_option_type effects", change as indicated:

Table 138 — syntax_option_type effects
Element Effect(s) if set

…

collate Specifies that character ~~ranges of the form "[a-b]"~~comparisons and character range comparisons shall be locale sensitive.

…

Table 138 — `syntax_option_type` effects
Element	Effect(s) if set
`…`
`collate`	Specifies that character ~~ranges of the form "`[a-b]`"~~comparisons and character range comparisons shall be locale sensitive.
`…`

msg6849 (view)

Date: 2014-05-15.00:00:00

[ 2014-5-14, John Maddock response ]

The original intention was the original wording: namely that collate only made character ranges locale sensitive. To be frank it's a feature that's probably hardly ever used (though I have no real hard data on that), and is a leftover from early POSIX standards which required locale sensitive collation for character ranges, and then later changed to implementation defined if I remember correctly (basically nobody implemented locale-dependent collation).

So I guess the question is do we gain anything by requiring all character-comparisons to go through the locale when this bit is set? Certainly it adds a great deal to the implementation effort (it's not what Boost.Regex has ever done). I guess the question is are differing code-points that collate identically an important use case? I guess there might be a few Unicode code points that do that, but I don't know how to go about verifying that.

STL:

If this was unintentional, then [re.synopt]/1's table should be left alone, while [re.grammar]/14 should be changed instead.

Jeffrey Yasskin:

This page mentions that [V] in Swedish should match "W" in a perfect world.

However, the most recent version of TR18 retracts both language-specific loose matches and language-specific ranges because "for most full-featured regular expression engines, it is quite difficult to match under code point equivalences that are not 1:1" and "tailored ranges can be quite difficult to implement properly, and can have very unexpected results in practice. For example, languages may also vary whether they consider lowercase below uppercase or the reverse. This can have some surprising results: [a-Z] may not match anything if Z < a in that locale."

ECMAScript doesn't include collation at all.

IMO, +1 to changing 28.13 instead of 28.5.1. It seems like we'd be on fairly solid ground if we wanted to remove regex_constants::collate entirely, in favor of named character classes, but of course that's not for this issue.

msg6730 (view)

Date: 2012-02-12.00:00:00

[ 2012-02-12 Issaquah : recategorize as P3 ]

Marshall Clow: 28.13/14 only applies to ECMAScript

All: we're unsure

Jonathan Wakely: we should ask John Maddock

Move to P3

msg6729 (view)

Date: 2013-09-21.00:00:00

The table in [re.synopt]/1 says that regex_constants::collate "Specifies that character ranges of the form "[a-b]" shall be locale sensitive.", but [re.grammar]/14 says that it affects individual character comparisons too.

History
Date	User	Action	Args
2015-05-08 19:54:16	admin	set	messages: + msg7409
2014-02-13 21:00:33	admin	set	messages: + msg6849
2014-02-13 21:00:33	admin	set	status: new -> open
2013-10-12 19:19:09	admin	set	messages: + msg6730
2013-09-21 00:00:00	admin	create