Re: [PATCH] rcu: use try_cmpxchg in check_cpu_stall

Uros Bizjak <ubizjak@xxxxxxxxx> · Wed, 1 Mar 2023 11:28:47 +0100

On Wed, Mar 1, 2023 at 1:08 AM Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
>
> On Tue, 28 Feb 2023 18:30:14 -0500
> Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
> > >
> > > But looking at this use case, I'd actually NAK it, as it is misleading.
> >
> > I'm trying to parse this. You are saying it is misleading, because it
> > updates js when it doesn't need to?
>
> Correct.

I'm a bit late to the discussion (well, I have to sleep from time to
time, too), but in the hope that everybody interested in this issue
will find the reply, I'll try to clarify the "updates" claim:

The try_cmpxchg is written in such a way that benefits loops as well
as linear code, in the latter case it depends on the compiler to
eliminate the dead assignment.

When changing linear code from cmpxchg to try_cmpxchg, one has to take
care that the variable, passed by reference, is unused after cmpxchg,
so it can be considered as a temporary variable (as said elsewhere,
the alternative is to copy the value to a local temporary variable and
pass the pointer to this variable to try_cmpxchg - the compiler will
eliminate the assignment if the original variable is unused).

Even in linear code, the conversion from cmpxchg to try_cmpxchg is
able to eliminate assignment and compare, as can be seen when the code
is compiled with gcc-10.3.1:

    a1c5:    0f 84 53 03 00 00        je     a51e <rcu_sched_clock_irq+0x70e>
    a1cb:    48 89 c8                 mov    %rcx,%rax
    a1ce:    f0 48 0f b1 35 00 00     lock cmpxchg %rsi,0x0(%rip)
  # a1d7 <rcu_sched_clock_irq+0x3c7>
    a1d5:    00 00
            a1d3: R_X86_64_PC32    .data+0xf9c
    a1d7:    48 39 c1                 cmp    %rax,%rcx
    a1da:    0f 85 3e 03 00 00        jne    a51e <rcu_sched_clock_irq+0x70e>

to:

    a1d0:    0f 84 49 03 00 00        je     a51f <rcu_sched_clock_irq+0x70f>
    a1d6:    f0 48 0f b1 35 00 00     lock cmpxchg %rsi,0x0(%rip)
  # a1df <rcu_sched_clock_irq+0x3cf>
    a1dd:    00 00
            a1db: R_X86_64_PC32    .data+0xf9c
    a1df:    0f 85 3a 03 00 00        jne    a51f <rcu_sched_clock_irq+0x70f>

Newer compilers (e.g. gcc-12+) are able to use likely/unlikely
annotations to reorder the code, so the change is less visible. But
due to reordering, even targets that don't define try_cmpxchg natively
benefit from the change, please see thread at [1].

These benefits are the reason the change to try_cmpxchg was accepted
also in the linear code elsewhere in the linux kernel, e.g. [2,3] to
name a few commits, with a thumbs-up and a claim that the new code is
actually *clearer* at the merge commit [4].

I really think that the above demonstrates various improvements, and
would be unfortunate not to consider them.

[1] https://lore.kernel.org/lkml/871qwgmqws.fsf@xxxxxxxxxxxxxxxxxx/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4e1da8fe031303599e78f88e0dad9f44272e4f99
[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8baceabca656d5ef4494cdeb3b9b9fbb844ac613
[4] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=91bc559d8d3aed488b4b50e9eba1d7ebb1da7bbf

Uros.

> >
> > > As try_cmpxchg() is used to get rid of the updating of the old value. As in
> > > the ring buffer code we had:
> > >
> > > void ring_buffer_record_off(struct trace_buffer *buffer)
> > > {
> > >         unsigned int rd;
> > >         unsigned int new_rd;
> > >
> > >         do {
> > >                 rd = atomic_read(&buffer->record_disabled);
> > >                 new_rd = rd | RB_BUFFER_OFF;
> > >         } while (!atomic_cmpxchg(&buffer->record_disabled, &rd, new_rd) != rd);
> >
> > Hear you actually meant "rd" as the second parameter without the & ?
>
> Yes, I cut and pasted the updated code and incorrectly try to revert it in
> this example :-p
>
> >
> > > }
> > >
> > > and the try_cmpxchg() converted it to:
> > >
> > > void ring_buffer_record_off(struct trace_buffer *buffer)
> > > {
> > >         unsigned int rd;
> > >         unsigned int new_rd;
> > >
> > >         rd = atomic_read(&buffer->record_disabled);
> > >         do {
> > >                 new_rd = rd | RB_BUFFER_OFF;
> > >         } while (!atomic_try_cmpxchg(&buffer->record_disabled, &rd, new_rd));
> > > }
> > >
> > > Which got rid of the need to constantly update the rd variable (cmpxchg
> > > will load rax with the value read, so it removes the need for an extra
> > > move).
> >
> > So that's a good thing?
>
> Yes. For looping, try_cmpxchg() is the proper function to use. But in the
> RCU case (and other cases in the ring-buffer patch) there is no loop, and
> no need to modify the "old" variable.
>
> >
> > >
> > > But in your case, we don't need to update js, in which case the
> > > try_cmpxchg() does.
> >
> > Right, it has lesser value here but I'm curious why you feel it also
> > doesn't belong in that ring buffer loop you shared (or did you mean,
> > it does belong there but not in other ftrace code modified by Uros?).
>
> The ring buffer patch had more than one change, where half the updates were
> fine, and half were not.
>
> -- Steve