views::split drops trailing empty range
Barry Revzin

Created on 2020-08-20.00:00:00 last changed 3 weeks ago


Date: 2020-09-15.00:00:00

[ 2020-09-02; Reflector prioritization ]

Set priority to 2 as result of reflector discussions.

Date: 2020-08-24.13:28:26

From StackOverflow, the program:

#include <iostream>
#include <string>
#include <ranges>

int main() 
  std::string s = " text ";
  auto sv = std::ranges::views::split(s, ' ');                    
  std::cout << std::ranges::distance(sv.begin(), sv.end());

prints 2 (as specified), but it really should print 3. If a range has N delimiters in it, splitting should produce N+1 pieces. If the Nth delimiter is the last element in the input range, views::split produces only N pieces — it doesn't emit a trailing empty range.

Going through a bunch of languages gets a sense of what they all do here. There are basically two groups (and Haskell goes in both because it has several different split functions)

  1. Rust, Python, Javascript, Go, Kotlin, Haskell's "splitOn" all provide N+1 parts if there were N delimiters.

  2. APL, D, Elixir, Haskell's "words", Ruby, and Clojure all compress all empty words. Splitting " x " on " " would give ["x"] here, whereas the languages in the above group would give ["", "x", ""]

Java is distinct from both groups in that it is mostly a first category language, except that by default it removes all trailing empty strings (but it keeps all leading and intermediate empty strings, unlike the second category languages) — although it has a parameter that lets you keep the trailing ones too.

C++20's behavior is closest to Java's default, except that it only removes one trailing empty string instead of every trailing empty string — and this behavior is not parameterizeable. But I think the intent is to be squarely in the first category, so I think the current behavior is just a specification error.

Many of these languages also provide an additional extra parameter to limit how many splits happen (e.g. Java, Kotlin, Python, Rust, JavaScript), but that's a separate design question.

Date User Action Args
2020-09-02 17:46:04adminsetmessages: + msg11469
2020-08-20 00:00:00admincreate