Title
regex_iterator and join_view don't work together very well
Status
new
Section
[re.iter][range.join]
Submitter
Barry Revzin

Created on 2022-05-12.00:00:00 last changed 1 month ago

Messages

Date: 2022-05-15.00:00:00

[ 2022-05-17; Reflector poll ]

Set priority to 2 after reflector poll.

Date: 2022-05-12.00:00:00

Consider this example (from StackOverflow):

#include <ranges>
#include <regex>
#include <iostream>

int main() {
  char const text[] = "Hello";
  std::regex regex{"[a-z]"};

  auto lower = std::ranges::subrange(
        std::cregex_iterator(
            std::ranges::begin(text),
            std::ranges::end(text),
            regex),
        std::cregex_iterator{}
    )
    | std::views::join
    | std::views::transform([](auto const& sm) {
        return std::string_view(sm.first, sm.second);
    });

  for (auto const& sv : lower) {
    std::cout << sv << '\n';
  }
}

This example seems sound, having lower be a range of string_view that should refer back into text, which is in scope for all this time. The std::regex object is also in scope for all this time.

Yet, if run this through address sanitizer, this blows up in the first call to the dereference operator of the underlying transform_view's iterator with heap-use-after-free.

The problem here is ultimately that regex_iterator is a stashing iterator (it has a member match_results) yet advertises itself as a forward_iterator (despite violating [forward.iterators] p6 and [iterator.concept.forward] p3.

Then, join_view's iterator stores an outer iterator (the regex_iterator) and an inner_iterator (an iterator into the container that the regex_iterator stashes). Copying that iterator effectively invalidates it — since the new iterator's inner iterator will refer to the old iterator's outer iterator's container. These aren't (and can't be) independent copies. In this particular example, join_view's begin iterator is copied into the transform_view's iterator, and then the original is destroyed (which owns the container that the new inner iterator still points to), which causes us to have a dangling iterator.

Note that the example is well-formed in libc++ because libc++ moves instead of copying an iterator, which happens to work. But I can produce other non-transform-view related examples that fail.

This is actually two different problems:

  1. regex_iterator is really an input iterator, not a forward iterator. It does not meet either the C++17 or the C++20 forward iterator requirements.

  2. join_view can't handle stashing iterators, and would need to additionally store the outer iterator in a non-propagating-cache for input ranges (similar to how it already potentially stores the inner iterator in a non-propagating-cache).

(So potentially this could be two different LWG issues, but it seems nicer to think of them together.)

History
Date User Action Args
2022-05-17 11:58:16adminsetmessages: + msg12469
2022-05-12 00:00:00admincreate