On 2018/09/17 0:08, Paul E. McKenney wrote: > On Sun, Sep 16, 2018 at 05:37:23PM +0900, Akira Yokosawa wrote: >> Hi Paul, >> >> Every time I review code under CodeSamples/, >> I find myself confused where to use READ_ONCE/WRITE_ONCEs. >> >> I'm looking at Listing 5.3 of current master. >> >> There are two cases which lack READ_ONCE/WRITE_ONCE to access potentially >> shared variables, namely on line 5 (__get_thread_var(counter)++;) and >> on line 14 (sum += per_thread(counter, t);). >> >> Line 5 looks like a good candidate to be optimized out when inlined. >> But the performance result indicates "gcc -O3" keeps it inside the loop. >> >> Is this because the definition of __get_thread_var() contains >> a call to smp_thread_id() and complicated enough not to be optimized >> out? >> >> As for line 14, as per_thread() was derived from per_cpu() of kernel >> API, I looked for call sites of per_cpu() in the kernel source tree. >> >> There are very few cases where READ_ONCE/WRITE_ONCE is used along >> with per_cpu(). There are two READ_ONCEs with per_cpu() in >> kernel/rcu/srcutree.c, whose author is none other than you. >> Are those READ_ONCEs necessary? >> >> I don't grasp the actual definition of per_cpu() macro. >> Definition of per_thread() macro under CodeSamples/api-pthreads/ >> does not look so complicated, but contains array indexing, >> which might be good enough to prevent optimization in the loop. >> >> I'm not sure, but my gut feeling is that READ_ONCE/WRITE_ONCE >> is necessary to access an unannotated variable. If we need >> volatility for sure, we could modify the definition of annotating >> macros/functions. >> >> Can you enlighten me? > > Yes, I do need to expand on this, both in perfbook and in order to inspect > my usage in Linux-kernel RCU. Please see below for my current outline, > and any and all feedback would be deeply appreciated! > So, I applied your outline to accesses in Listing 5.3. See inline comments below for my observation. > Thanx, Paul > > ------------------------------------------------------------------------ > > > Plain accesses, that is, without either READ_ONCE() or WRITE_ONCE(): > o Only accessed under lock. > o Only accessed by owning CPU/thread. > o Only modified by owning CPU/thread under lock, and otherwise > accessed only either while holding lock or by owning CPU/thread. > > WRITE_ONCE() for modifications, READ_ONCE() for non-owning CPUs/threads > not holding lock: > o Only modified while holding lock by owning CPU/thread, > could be read from anywhere by anyone. > > WRITE_ONCE() for modifications, READ_ONCE() for CPUs/threads not holding > lock: > o Only modified while holding lock, could be read anywhere by > anyone. > > WRITE_ONCE() for modifications, READ_ONCE() for non-owning > CPUs/threads: > o Only modified by owning CPU/threads, could be read anywhere > by anyone. Listing 5.3's per-thread variable "counter" looks covered here. So the write part of the increment on line 5 needs WRITE_ONCE(). Read access on line 14 also needs READ_ONCE(). Read part of the increment doesn't need READ_ONCE(), I guess. Line 5 would look like: WRITE_ONCE(__get_thread_var(counter), __get_thread_var(counter) + 1); . Considering the cost of __get_thread_var(), there should be a room to help compilers to optimize by using a local pointer as follows: static __inline__ void inc_count(void) { long *p_counter = &__get_thread_var(counter); WRITE_ONCE(*p_counter, *p_counter + 1); } . Does this look reasonable to you? Or is it better to also use READ_ONCE()? But if __get_thread_var() does imply volatility, we can keep the current way of increment: __get_thread_var(counter)++; Thoughts? Thanks, Akira > > Otherwise, READ_ONCE() and WRITE_ONCE() everywhere. > > But note that READ_ONCE() and WRITE_ONCE() provide absolutely no > ordering guarantees unless: > > o There is only one variable being accesses, and all of those > accesses are of the same size and alignment. > > o There is only one CPU/thread doing the accesses to all > of the variables during the timeframe of interest. > > o There are other operations involved, for example, smp_mb(). >