Issue 2265: 29.3p9 appears to rule out some acceptable executions

Title: 29.3p9 appears to rule out some acceptable executions
Status: open
Section: [atomics.order]
Submitter: Brian Demsky

Created on 2013-06-17.00:00:00 last changed 134 months ago

Messages

Date: 2015-05-08.04:23:26

[ 2015-05 Lenexa, SG1 response ]

This was partially addressed (weasel-worded) in C++14 (See N3786). The remainder is an open research problem. N3710 outlines a "solution" that doesn't have a consensus behind it because it costs performance. We have no better solution at the moment.

msg7330 (view)

Date: 2015-04-04.16:45:17

[ 2015-02 Cologne ]

Handed over to SG1.

msg6539 (view)

Date: 2013-06-17.00:00:00

I believe that the following variation on IRIW should admit executions in which c1 = d1 = 5 and c2 = d2 = 0. If this is allowed, then what is sequence of program evaluations for [atomics.order] p9 that justifies the store to z? It seems that [atomics.order] p9 should not allow this execution because one of the stores to x or y has to appear earlier in the sequence, each of the fetch_adds reads the previous load in the thread (and thus must appear later in the sequence), and [atomics.order] p9 states that each load must read from the last prior assignment in the sequence.

atomic_int x;
atomic_int y;
atomic_int z;
int c1, c2, d1, d2;

static void a(void* obj)
{
  atomic_store_explicit(&x, 5, memory_order_relaxed); 
}

static void b(void* obj)
{
  atomic_store_explicit(&y, 5, memory_order_relaxed); 
}

static void c(void* obj)
{
  c1 = atomic_load_explicit(&x, memory_order_relaxed);
  // this could also be an atomic load if the address depends on c1:
  c2 = atomic_fetch_add_explicit(&y, c1, memory_order_relaxed);  
}

static void d(void* obj)
{
  d1 = atomic_load_explicit(&y, memory_order_relaxed);
  d2 = atomic_fetch_add_explicit(&x, d1, memory_order_relaxed); 
}

int user_main(int argc, char** argv)
{
  thrd_t t1, t2, t3, t4;

  atomic_init(&x, 0);
  atomic_init(&y, 0);

  printf("Main thread: creating 4 threads\n");
  thrd_create(&t1, (thrd_start_t)&a, NULL);
  thrd_create(&t2, (thrd_start_t)&b, NULL);
  thrd_create(&t3, (thrd_start_t)&c, NULL);
  thrd_create(&t4, (thrd_start_t)&d, NULL);

  thrd_join(t1);
  thrd_join(t2);
  thrd_join(t3);
  thrd_join(t4);
  printf("c1=%d c2=%d\n",c1,c2);
  printf("d1=%d d2=%d\n",d1,d2);

  // Can this store write 1000 (i.e., c1=d1=5, c2=d2=0)?
  atomic_store(&z, (c1+d1)*100+c2+d2);

  printf("Main thread is finished\n");

  return 0;
}

It seems that the easiest fix is to allow a load in [atomics.order] p9 to read from any prior store in the evaluation order.

That said, I would personally advocate the following: It seems to me that C/C++ atomics are in a bit of different situation than Java because:

People are expected to use relaxed C++ atomics in potentially racy situations, so it isn't clear that semantics as complicated as the JMM's causality would be sane.
People who use C/C++ atomics are likely to be experts and use them in a very controlled fashion. I would be really surprised if compilers would find any real wins by optimizing the use of atomics.

Why not do something like:

There is satisfaction DAG of all program evaluations. Each evaluation observes the values of variables as computed by some prior assignment in the DAG.

There is an edge x->y between two evaluations x and y if:

the evaluation y observes a value computed by the evaluation x or
the evaluation y is an atomic store, the evaluation x is an atomic load, and there is a condition branch c that may depend (intrathread dependence) on x and x-sb->c and c-sb->y.

This seems to allow reordering of relaxed atomics that processors do without extra fence instructions, allows most reorderings by the compiler, and gets rid of satisfaction cycles.

History
Date	User	Action	Args
2015-05-08 04:23:26	admin	set	messages: + msg7396
2015-04-04 16:45:17	admin	set	messages: + msg7330
2015-04-04 16:45:17	admin	set	status: new -> open
2013-06-17 00:00:00	admin	create