Issue 2584: <regex> ECMAScript IdentityEscape is ambiguous

Title: <regex> ECMAScript IdentityEscape is ambiguous
Status: c++17
Section: [re.grammar]
Submitter: Billy O'Neal III

Created on 2016-01-13.00:00:00 last changed 103 months ago

Messages

Date: 2016-08-02.17:19:11

Proposed resolution:

This wording is relative to N4567.

Change [re.grammar]/3 as indicated:

-3- The following productions within the ECMAScript grammar are modified as follows:
ClassAtom ::
  -
  ClassAtomNoDash
  ClassAtomExClass
  ClassAtomCollatingElement
  ClassAtomEquivalence
  
IdentityEscape ::
  SourceCharacter but not c

msg7686 (view)

Date: 2016-08-02.17:19:11

[ 2016-08, Chicago ]

Monday PM: Move to tentatively ready

msg7685 (view)

Date: 2016-01-13.00:00:00

Stephan and I are seeing differences in implementation for how non-special characters should be handled in the IdentityEscape part of the ECMAScript grammar. For example:

#include <stdio.h>
#include <iostream>
#ifdef USE_BOOST
#include <boost/regex.hpp>
using namespace boost;
#else
#include <regex>
#endif
using namespace std;

int main() {
  try {
    const regex r("\\z");
    cout << "Constructed \\z." << endl;
    if (regex_match("z", r))
      cout << "Matches z" << endl;
  } catch (const regex_error& e) {
      cout << e.what() << endl;
  }
}

libstdc++, boost, and browsers I tested with (Microsoft Edge, Google Chrome) all happily interpret \z, which otherwise has no meaning, as an identity character escape for the letter z. libc++ and msvc++ say that this is invalid, and throw regex_error with error_escape.

ECMAScript 3 (which is what C++ currently points to) seems to agree with libc++ and msvc++:

IdentityEscape ::
  SourceCharacter but not IdentifierPart

IdentifierPart ::
  IdentifierStart
  UnicodeCombiningMark
  UnicodeDigit
  UnicodeConnectorPunctuation
  \ UnicodeEscapeSequence

IdentifierStart ::
  UnicodeLetter
  $
  _
  \ UnicodeEscapeSequence

But this doesn't make any sense — it prohibits things like \$ which users absolutely need to be able to escape. So let's look at ECMAScript 6. I believe this says much the same thing, but updates the spec to better handle Unicode by referencing what the Unicode standard says is an identifier character:

IdentityEscape ::
  SyntaxCharacter
  /
  SourceCharacter but not UnicodeIDContinue
  
UnicodeIDContinue ::
  any Unicode code point with the Unicode property "ID_Continue", "Other_ID_Continue", or "Other_ID_Start"

However, ECMAScript 6 has an appendix B defining "additional features for web browsers" which says:

IdentityEscape ::
  SourceCharacter but not c

which appears to agree with what libstdc++, boost, and browsers are doing.

What should be the correct behavior here?

History
Date	User	Action	Args
2017-07-30 20:15:43	admin	set	status: wp -> c++17
2016-11-14 03:59:28	admin	set	status: pending -> wp
2016-11-14 03:55:22	admin	set	status: ready -> pending
2016-08-02 17:19:11	admin	set	messages: + msg8332
2016-08-02 17:19:11	admin	set	status: new -> ready
2016-01-16 21:32:44	admin	set	messages: + msg7686
2016-01-13 00:00:00	admin	create