Re: [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Tue, 31 Oct 2023 07:43:18 -0700

On Tue, Oct 31, 2023 at 02:16:34PM +0000, Michael Matz wrote:
> Hello,
> 
> On Tue, 31 Oct 2023, Peter Zijlstra wrote:
> 
> (I can't say anything about the WRITE_ONCE/rcu code, just about the below 
> codegen part)
> 
> > Welcome is not the right word. What bugs me most is that this was never
> > raised when this code was written :/
> > 
> > Mostly my problem is that GCC generates such utter shite when you
> > mention volatile. See, the below patch changes the perfectly fine and
> > non-broken:
> > 
> > 0148  1d8:      49 83 06 01             addq   $0x1,(%r14)
> 
> What is non-broken here that is ...
> 
> > into:
> > 
> > 0148  1d8:	49 8b 06             	mov    (%r14),%rax
> > 014b  1db:	48 83 c0 01          	add    $0x1,%rax
> > 014f  1df:	49 89 06             	mov    %rax,(%r14)
> 
> ... broken here?  (Sure code size and additional register use, but I don't 
> think you mean this with broken).
> 
> > For absolutely no reason :-(
> 
> The reason is simple (and should be obvious): to adhere to the abstract 
> machine regarding volatile.  When x is volatile then x++ consists of a 
> read and a write, in this order.  The easiest way to ensure this is to 
> actually generate a read and a write instruction.  Anything else is an 
> optimization, and for each such optimization you need to actively find an 
> argument why this optimization is correct to start with (and then if it's 
> an optimization at all).  In this case the argument needs to somehow 
> involve arguing that an rmw instruction on x86 is in fact completely 
> equivalent to the separate instructions, from read cycle to write cycle 
> over all pipeline stages, on all implementations of x86.  I.e. that a rmw 
> instruction is spec'ed to be equivalent.
> 
> You most probably can make that argument in this specific case, I'll give 
> you that.  But why bother to start with, in a piece of software that is 
> already fairly complex (the compiler)?  It's much easier to just not do 
> much anything with volatile accesses at all and be guaranteed correct.
> Even more so as the software author, when using volatile, most likely is 
> much more interested in correct code (even from a abstract machine 
> perspective) than micro optimizations.

I suspect that Peter cares because the code in question is on a fast
path within the scheduler.

> > At least clang doesn't do this, it stays:
> > 
> > 0403  413:      49 ff 45 00             incq   0x0(%r13)
> > 
> > irrespective of the volatile.
> 
> And, are you 100% sure that this is correct?  Even for x86 CPU 
> pipeline implementations that you aren't intimately knowing about? ;-)

Me, I am not even sure that the load-increment-store is always faithfully
executed by the hardware.  ;-)

> But all that seems to be a side-track anyway, what's your real worry with  
> the code sequence generated by GCC?

Again, my guess is that Peter cares deeply about the speed of the fast
path on which this increment occurs.

							Thanx, Paul