Title
Implementability of locale-sensitive UnicodeEscapeSequence matching
Status
new
Section
[re.grammar]
Submitter
Hubert Tong

Created on 2015-10-08.00:00:00 last changed 2 months ago

Messages

Date: 2024-10-15.00:00:00

[ 2024-10-03; Jonathan comments ]

std::basic_regex<charT> only properly supports matching single code units that fit in `charT`. There's nothing in the spec that supports matching code points that require multiple code units, let alone checking whether a character in an arbitrary encoding corresponds to any given Unicode code point. [re.grammar] paragraph 12 appears to be an attempt to allow implementations to fail to match here, but is insufficient. When is_unsigned_v<char> is true, the CV of the UnicodeEscapeSequence `"\u0080"` is not greater than `CHAR_MAX`, but that doesn't help because U+0080 is encoded as two bytes in UTF-8. Being able to represent `0x80` as `char` does not mean the CV can be matched as a single `char`. The API is unsuitable for Unicode-aware strings.

Date: 2015-10-13.18:22:47

In [re.grammar] paragraph 2:

basic_regex member functions shall not call any locale dependent C or C++ API, including the formatted string input functions. Instead they shall call the appropriate traits member function to achieve the required effect.

Yet, the required interface for a regular expression traits class ([re.req]) does not appear to have any reliable method for determining whether a character as encoded for the locale associated with the traits instance is the same as a character represented by a UnicodeEscapeSequence, e.g., assuming a sane ru_RU.koi8r locale:

#include <stdio.h>
#include <stdlib.h>
#include <regex>

const char data[] = "\xB3";
const char matchCyrillicCaptialLetterYo[] = R"(\u0401)";

int main(void)
{
  try {
    std::regex myRegex;
    myRegex.imbue(std::locale("ru_RU.koi8r"));

    myRegex.assign(matchCyrillicCaptialLetterYo, std::regex_constants::ECMAScript);
    printf("(%s)\n", std::regex_replace(std::string(data), myRegex, std::string("E")).c_str());

    myRegex.assign("[[:alpha:]]", std::regex_constants::ECMAScript);
    printf("(%s)\n", std::regex_replace(std::string(data), myRegex, std::string("E")).c_str());
  } catch (std::regex_error& e) {
    abort();
  }
  return 0;
}

The implementation I tried prints:

(Ё)
(E)

Which means that the character class matching worked, but not the matching to the UnicodeEscapeSequence.

History
Date User Action Args
2024-10-03 19:38:47adminsetmessages: + msg14421
2015-10-08 00:00:00admincreate