Title
regex_constants::collate's effects are inaccurately summarized
Status
open
Section
[re.synopt]
Submitter
Stephan T. Lavavej

Created on 2013-09-21.00:00:00 last changed 108 months ago

Messages

Date: 2015-05-08.19:54:16

Proposed resolution:

This wording is relative to N3691.

  1. In [re.synopt]/1, Table 138 — "syntax_option_type effects", change as indicated:

    Table 138 — syntax_option_type effects
    Element Effect(s) if set
    collate Specifies that character ranges of the form "[a-b]"comparisons and character range comparisons shall be locale sensitive.
Date: 2014-05-15.00:00:00

[ 2014-5-14, John Maddock response ]

The original intention was the original wording: namely that collate only made character ranges locale sensitive. To be frank it's a feature that's probably hardly ever used (though I have no real hard data on that), and is a leftover from early POSIX standards which required locale sensitive collation for character ranges, and then later changed to implementation defined if I remember correctly (basically nobody implemented locale-dependent collation).

So I guess the question is do we gain anything by requiring all character-comparisons to go through the locale when this bit is set? Certainly it adds a great deal to the implementation effort (it's not what Boost.Regex has ever done). I guess the question is are differing code-points that collate identically an important use case? I guess there might be a few Unicode code points that do that, but I don't know how to go about verifying that.

STL:

If this was unintentional, then [re.synopt]/1's table should be left alone, while [re.grammar]/14 should be changed instead.

Jeffrey Yasskin:

This page mentions that [V] in Swedish should match "W" in a perfect world.

However, the most recent version of TR18 retracts both language-specific loose matches and language-specific ranges because "for most full-featured regular expression engines, it is quite difficult to match under code point equivalences that are not 1:1" and "tailored ranges can be quite difficult to implement properly, and can have very unexpected results in practice. For example, languages may also vary whether they consider lowercase below uppercase or the reverse. This can have some surprising results: [a-Z] may not match anything if Z < a in that locale."

ECMAScript doesn't include collation at all.

IMO, +1 to changing 28.13 instead of 28.5.1. It seems like we'd be on fairly solid ground if we wanted to remove regex_constants::collate entirely, in favor of named character classes, but of course that's not for this issue.

Date: 2012-02-12.00:00:00

[ 2012-02-12 Issaquah : recategorize as P3 ]

Marshall Clow: 28.13/14 only applies to ECMAScript

All: we're unsure

Jonathan Wakely: we should ask John Maddock

Move to P3

Date: 2013-09-21.00:00:00

The table in [re.synopt]/1 says that regex_constants::collate "Specifies that character ranges of the form "[a-b]" shall be locale sensitive.", but [re.grammar]/14 says that it affects individual character comparisons too.

History
Date User Action Args
2015-05-08 19:54:16adminsetmessages: + msg7409
2014-02-13 21:00:33adminsetmessages: + msg6849
2014-02-13 21:00:33adminsetstatus: new -> open
2013-10-12 19:19:09adminsetmessages: + msg6730
2013-09-21 00:00:00admincreate