Created on 2015-10-08.00:00:00 last changed 2 months ago
[ 2024-10-03; Jonathan comments ]
std::basic_regex<charT>
only properly supports
matching single code units that fit in `charT`.
There's nothing in the spec that supports matching code points that
require multiple code units, let alone checking whether a character
in an arbitrary encoding corresponds to any given Unicode code point.
[re.grammar] paragraph 12 appears to be an attempt to
allow implementations to fail to match here, but is insufficient.
When is_unsigned_v<char>
is true, the CV of the
UnicodeEscapeSequence `"\u0080"` is not greater than `CHAR_MAX`,
but that doesn't help because U+0080 is encoded as two bytes in UTF-8.
Being able to represent `0x80` as `char` does not mean the CV can be
matched as a single `char`.
The API is unsuitable for Unicode-aware strings.
In [re.grammar] paragraph 2:
basic_regex member functions shall not call any locale dependent C or C++ API, including the formatted string input functions. Instead they shall call the appropriate traits member function to achieve the required effect.
Yet, the required interface for a regular expression traits class ([re.req]) does not appear to have any reliable method for determining whether a character as encoded for the locale associated with the traits instance is the same as a character represented by a UnicodeEscapeSequence, e.g., assuming a sane ru_RU.koi8r locale:
#include <stdio.h> #include <stdlib.h> #include <regex> const char data[] = "\xB3"; const char matchCyrillicCaptialLetterYo[] = R"(\u0401)"; int main(void) { try { std::regex myRegex; myRegex.imbue(std::locale("ru_RU.koi8r")); myRegex.assign(matchCyrillicCaptialLetterYo, std::regex_constants::ECMAScript); printf("(%s)\n", std::regex_replace(std::string(data), myRegex, std::string("E")).c_str()); myRegex.assign("[[:alpha:]]", std::regex_constants::ECMAScript); printf("(%s)\n", std::regex_replace(std::string(data), myRegex, std::string("E")).c_str()); } catch (std::regex_error& e) { abort(); } return 0; }
The implementation I tried prints:
(Ё) (E)
Which means that the character class matching worked, but not the matching to the UnicodeEscapeSequence.
History | |||
---|---|---|---|
Date | User | Action | Args |
2024-10-03 19:38:47 | admin | set | messages: + msg14421 |
2015-10-08 00:00:00 | admin | create |