Hi Paul, Every time I review code under CodeSamples/, I find myself confused where to use READ_ONCE/WRITE_ONCEs. I'm looking at Listing 5.3 of current master. There are two cases which lack READ_ONCE/WRITE_ONCE to access potentially shared variables, namely on line 5 (__get_thread_var(counter)++;) and on line 14 (sum += per_thread(counter, t);). Line 5 looks like a good candidate to be optimized out when inlined. But the performance result indicates "gcc -O3" keeps it inside the loop. Is this because the definition of __get_thread_var() contains a call to smp_thread_id() and complicated enough not to be optimized out? As for line 14, as per_thread() was derived from per_cpu() of kernel API, I looked for call sites of per_cpu() in the kernel source tree. There are very few cases where READ_ONCE/WRITE_ONCE is used along with per_cpu(). There are two READ_ONCEs with per_cpu() in kernel/rcu/srcutree.c, whose author is none other than you. Are those READ_ONCEs necessary? I don't grasp the actual definition of per_cpu() macro. Definition of per_thread() macro under CodeSamples/api-pthreads/ does not look so complicated, but contains array indexing, which might be good enough to prevent optimization in the loop. I'm not sure, but my gut feeling is that READ_ONCE/WRITE_ONCE is necessary to access an unannotated variable. If we need volatility for sure, we could modify the definition of annotating macros/functions. Can you enlighten me? Thanks, Akira