On 09/27, Paul E. McKenney wrote: > > On Wed, Sep 26, 2007 at 07:13:51PM +0400, Oleg Nesterov wrote: > > > > Yes, yes, I see now. We really need this barriers, except I think > > rcu_try_flip_idle() can use wmb. However, I have a bit offtopic question, > > > > // rcu_try_flip_waitzero() > > if (A == 0) { > > mb(); > > B == 0; > > } > > > > Do we really need the mb() in this case? How it is possible that STORE > > goes before LOAD? "Obviously", the LOAD should be completed first, no? > > Suppose that A was most recently stored by a CPU that shares a store > buffer with this CPU. Then it is possible that some other CPU sees > the store to B as happening before the store that "A==0" above is > loading from. This other CPU would naturally conclude that the store > to B must have happened before the load from A. Ah, I was confused by the comment, smp_mb(); /* Don't call for memory barriers before we see zero. */ ^^^^^^^^^^^^^^^^^^ So, in fact, we need this barrier to make sure that _other_ CPUs see these changes in order, thanks. Of course, _we_ already saw zero. But in that particular case this doesn't matter, rcu_try_flip_waitzero() is the only function which reads the "non-local" per_cpu(rcu_flipctr), so it doesn't really need the barrier? (besides, it is always called under fliplock). > In more detail, suppose that CPU 0 and 1 share a store buffer, and > that CPU 2 and 3 share a second store buffer. This happens naturally > if CPUs 0 and 1 are really just different hardware threads within a > single core. > > So, suppose the cacheline for A is initially owned by CPUs 2 and 3, > and that the cacheline for B is initially owned by CPUs 0 and 1. > Then consider the following sequence of events: > > o CPU 0 stores zero to A. This is a cache miss, so the new value > for A is placed in CPU 0's and 1's store buffer. > > o CPU 1 executes the above code, first loading A. It sees > the value of A==0 in the store buffer, and therefore > stores zero to B, which hits in the cache. (I am assuming > that we left out the mb() above). > > o CPU 2 loads from B, which misses the cache, and gets the > value that CPU 1 stored. Suppose it checks the value, > and based on this check, loads A. The old value of A might > still be in cache, which would lead CPU 2 to conclude that > the store to B by CPU 1 must have happened before the store > to A by CPU 0. > > Memory barriers would prevent this confusion. Thanks a lot!!! This fills another gap in my understanding. OK, the last (I promise :) off-topic question. When CPU 0 and 1 share a store buffer, the situation is simple, we can replace "CPU 0 stores" with "CPU 1 stores". But what if CPU 0 is equally "far" from CPUs 1 and 2? Suppose that CPU 1 does wmb(); B = 0 Can we assume that CPU 2 doing if (B == 0) { rmb(); must see all invalidations from CPU 0 which were seen by CPU 1 before wmb() ? > > > > If this is possible, can't we move the code doing "s/rcu_flipped/rcu_flip_seen/" > > > > from __rcu_advance_callbacks() to rcu_check_mb() to unify the "acks" ? > > > > > > I believe that we cannot safely do this. The rcu_flipped-to-rcu_flip_seen > > > transition has to be synchronized to the moving of the callbacks -- > > > either that or we need more GP_STAGES. > > > > Hmm. Still can't understand. > > Callbacks would be able to be injected into a grace period after it > started. Yes, but this is _exactly_ what the current code does in the scenario below, > Or are you arguing that as long as interrupts remain disabled between > the two events, no harm done? no, > > Suppose that we are doing call_rcu(), and __rcu_advance_callbacks() sees > > rdp->completed == rcu_ctrlblk.completed but rcu_flip_flag = rcu_flipped > > (say, another CPU does rcu_try_flip_idle() in between). > > > > We ack the flip, call_rcu() enables irqs, the timer interrupt calls > > __rcu_advance_callbacks() again and moves the callbacks. > > > > So, it is still possible that "move callbacks" and "ack the flip" happen > > out of order. But why this is bad? Look, what happens is // call_rcu() rcu_flip_flag = rcu_flipped insert the new callback // timer irq move the callbacks (the new one goes to wait[0]) But I still can't understand why this is bad, > > This can't "speedup" the moving of our callbacks from next to done lists. > > Yes, RCU state machine can switch to the next state/stage, but this looks > > safe to me. Before this callback will be flushed, we need 2 rdp->completed != rcu_ctrlblk.completed further events, we can't miss rcu_read_lock() section, no? > > Help! Please :) > > if (rcu_ctrlblk.completed == rdp->completed) > > rcu_try_flip(); > > > > Could you clarify the check above? Afaics this is just optimization, > > technically it is correct to rcu_try_flip() at any time, even if ->completed > > are not synchronized. Most probably in that case rcu_try_flip_waitack() will > > fail, but nothing bad can happen, yes? > > >From a conceptual viewpoint, if this CPU hasn't caught up with the > last grace-period stage, it has no business trying to push forward to > the next stage. So this might (or might not) happen to work with this > particular implementation, it needs to stay as is. We need this code > to be robust enough to optimize the grace-period latencies, right? Yes, yes. I just wanted to be sure I didn't miss some other subtle reason. > > void synchronize_sched(void) > > { > > struct migration_req req; > > > > req->task = NULL; > > init_completion(&req.done); > > > > for_each_online_cpu(cpu) { > > struct rq *rq = cpu_rq(cpu); > > int online; > > > > spin_lock_irq(&rq->lock); > > online = cpu_online(cpu); // HOTPLUG_CPU > > if (online) { > > list_add(&req->list, &rq->migration_queue); > > req.done.done = 0; > > } > > spin_unlock_irq(&rq->lock); > > > > if (online) { > > wake_up_process(rq->migration_thread); > > wait_for_completion(&req.done); > > } > > } > > } > > > > I need to think about your approach above. It looks like you are > leveraging the migration tasks, but I am concerned about concurrent > hotplug events. I hope this is OK, note that migration_call(CPU_DEAD) flushes ->migration_queue, if we take rq->lock after that we must see !cpu_online(cpu). CPU_UP event is not interesting for us, we can miss it. Hmm... but wake_up_process() should be moved under spin_lock(). Oleg. - To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html