On Fri, Sep 27, 2013 at 08:15:32PM +0200, Oleg Nesterov wrote: > On 09/26, Peter Zijlstra wrote: > > > > But if the readers does see BLOCK it will not be an active reader no > > more; and thus the writer doesn't need to observe and wait for it. > > I meant they both can block, but please ignore. Today I simply can't > understand what I was thinking about yesterday. I think we all know that state all too well ;-) > I tried hard to find any hole in this version but failed, I believe it > is correct. Yay! > But, could you help me to understand some details? I'll try, but I'm not too bright atm myself :-) > > +void __get_online_cpus(void) > > +{ > > +again: > > + /* See __srcu_read_lock() */ > > + __this_cpu_inc(__cpuhp_refcount); > > + smp_mb(); /* A matches B, E */ > > + __this_cpu_inc(cpuhp_seq); > > + > > + if (unlikely(__cpuhp_state == readers_block)) { > > Note that there is no barrier() after inc(seq) and __cpuhp_state > check, this inc() can be "postponed" till ... > > > +void __put_online_cpus(void) > > { > > + /* See __srcu_read_unlock() */ > > + smp_mb(); /* C matches D */ > > ... this mb() in __put_online_cpus(). > > And this is fine! The qustion is, perhaps it would be more "natural" > and understandable to shift this_cpu_inc(cpuhp_seq) into > __put_online_cpus(). Possibly; I never got further than that the required order is: ref++ MB seq++ MB ref-- It doesn't matter if the seq++ is in the lock or unlock primitive. I never considered one place more natural than the other. > We need to ensure 2 things: > > 1. The reader should notic state = BLOCK or the writer should see > inc(__cpuhp_refcount). This is guaranteed by 2 mb's in > __get_online_cpus() and in cpu_hotplug_begin(). > > We do not care if the writer misses some inc(__cpuhp_refcount) > in per_cpu_sum(__cpuhp_refcount), that reader(s) should notice > state = readers_block (and inc(cpuhp_seq) can't help anyway). Agreed. > 2. If the writer sees the result of this_cpu_dec(__cpuhp_refcount) > from __put_online_cpus() (note that the writer can miss the > corresponding inc() if it was done on another CPU, so this dec() > can lead to sum() == 0), it should also notice the change in cpuhp_seq. > > Fortunately, this can only happen if the reader migrates, in > this case schedule() provides a barrier, the writer can't miss > the change in cpuhp_seq. Again, agreed; this is also the message of the second comment in cpuhp_readers_active_check() by Paul. > IOW. Unless I missed something, cpuhp_seq is actually needed to > serialize __put_online_cpus()->this_cpu_dec(__cpuhp_refcount) and > and /* D matches C */ in cpuhp_readers_active_check(), and this > is not immediately clear if you look at __get_online_cpus(). > > I do not suggest to change this code, but please tell me if my > understanding is not correct. I think you're entirely right. > > +static bool cpuhp_readers_active_check(void) > > { > > + unsigned int seq = per_cpu_sum(cpuhp_seq); > > + > > + smp_mb(); /* B matches A */ > > + > > + /* > > + * In other words, if we see __get_online_cpus() cpuhp_seq increment, > > + * we are guaranteed to also see its __cpuhp_refcount increment. > > + */ > > > > + if (per_cpu_sum(__cpuhp_refcount) != 0) > > + return false; > > > > + smp_mb(); /* D matches C */ > > It seems that both barries could be smp_rmb() ? I am not sure the comments > from srcu_readers_active_idx_check() can explain mb(), note that > __srcu_read_lock() always succeeds unlike get_cpus_online(). I see what you mean; cpuhp_readers_active_check() is all purely reads; there are no writes to order. Paul; is there any argument for the MB here as opposed to RMB; and if not should we change both these and SRCU? > > void cpu_hotplug_done(void) > > { > > + /* Signal the writer is done, no fast path yet. */ > > + __cpuhp_state = readers_slow; > > + wake_up_all(&cpuhp_readers); > > + > > + /* > > + * The wait_event()/wake_up_all() prevents the race where the readers > > + * are delayed between fetching __cpuhp_state and blocking. > > + */ > > + > > + /* See percpu_up_write(); readers will no longer attempt to block. */ > > + synchronize_sched(); > > + > > + /* Let 'em rip */ > > + __cpuhp_state = readers_fast; > > + current->cpuhp_ref--; > > + > > + /* > > + * Wait for any pending readers to be running. This ensures readers > > + * after writer and avoids writers starving readers. > > + */ > > + wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount)); > > } > > OK, to some degree I can understand "avoids writers starving readers" > part (although the next writer should do synchronize_sched() first), > but could you explain "ensures readers after writer" ? Suppose reader A sees state == BLOCK and goes to sleep; our writer B does cpu_hotplug_done() and wakes all pending readers. If for some reason A doesn't schedule to inc ref until B again executes cpu_hotplug_begin() and state is once again BLOCK, A will not have made any progress. The waitcount increment before __put_online_cpus() ensures cpu_hotplug_done() sees the !0 waitcount and waits until out reader runs far enough to at least pass the dec_and_test(). And once past the dec_and_test() preemption is disabled and the sched_sync() in a new cpu_hotplug_begin() will suffice to guarantee we'll have acquired a reference and are an active reader. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>