Re: [PATCH bpf-next v2 3/6] locking/local_lock: Introduce local_trylock_irqsave()

Vlastimil Babka <vbabka@xxxxxxx> · Thu, 12 Dec 2024 10:15:32 +0100

On 12/12/24 03:49, Alexei Starovoitov wrote:
> On Wed, Dec 11, 2024 at 3:55 AM Vlastimil Babka <vbabka@xxxxxxx> wrote:
>>
>> On 12/11/24 11:53, Vlastimil Babka wrote:
>> > On 12/10/24 03:39, Alexei Starovoitov wrote:
>> >> From: Alexei Starovoitov <ast@xxxxxxxxxx>
>> >>
>> >> Similar to local_lock_irqsave() introduce local_trylock_irqsave().
>> >> It uses spin_trylock in PREEMPT_RT and always succeeds when !RT.
>> >
>> > Hmm but is that correct to always succeed? If we're in an nmi, we might be
>> > preempting an existing local_(try)lock_irqsave() critical section because
>> > disabling irqs doesn't stop NMI's, right?
>>
>> So unless I'm missing something, it would need to be a new kind of local
>> lock to support this trylock operation on !RT?
> 
> Ohh. Correct. Forgot about nmi interrupting local_lock_irqsave region in !RT.
> 
>> Perhaps based on the same
>> principle of a simple active/locked flag that I tried in my sheaves RFC? [1]
>> There could be also the advantage that if all (potentially) irq contexts
>> (not just nmi) used trylock, it would be sufficient to disable preeemption
>> and not interrupts, which is cheaper.
> 
> I don't think it's the case.
> pushf was slow on old x86.
> According to https://www.agner.org/optimize/instruction_tables.pdf
> it's 3 uops on skylake.
> That could be faster than preempt_disable (incl %gs:addr)
> which is 3-4 uops assuming cache hit.

I think the costly ones are not pushf, but cli and popf. I did some
microbenchmark in the kernel (Ryzen 2700, not really latest thing but
anyway) and IIRC it was twice slower to do the irqsave/restore than preempt
only.

>> The RT variant could work as you proposed here, that was wrong in my RFC as
>> you already pointed out when we discussed v1 of this series.
>>
>> [1]
>> https://lore.kernel.org/all/20241112-slub-percpu-caches-v1-5-ddc0bdc27e05@xxxxxxx/
> 
> I like your
> +struct local_tryirq_lock
> approach, but let's put it in local_lock.h ?

Sure, that was a proof of concept so kept it local.

> and it probably needs local_inc_return() instead of READ/WRITE_ONCE.
> With irq and nmis it's racy.

Hm guess you are right, thanks!

> In the meantime I think I will fix below:
> 
>> >> +#define __local_trylock_irqsave(lock, flags)                        \
>> >> +    ({                                                      \
>> >> +            local_irq_save(flags);                          \
>> >> +            local_trylock_acquire(this_cpu_ptr(lock));      \
>> >> +            1;                                              \
>> >> +    })
> 
> as
> #define __local_trylock_irqsave(lock, flags)                    \
>         ({                                                      \
>                 local_irq_save(flags);                          \
>                 local_trylock_acquire(this_cpu_ptr(lock));      \
>                 !in_nmi();                                      \
>         })
> 
> I think that's good enough for memcg patch 4 and
> doesn't grow local_lock_t on !RT.

But that means you'll never succeed in nmi, doesn't that limit the bpf use case?

> We can introduce
> 
> typedef struct {
>         int count;
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
>         struct lockdep_map      dep_map;
>         struct task_struct      *owner;
> #endif
> } local_trylock_t;
> 
> and the whole set of local_trylock_lock, local_trylock_unlock,...
> But that's quite a bit of code. Feels a bit overkill for memcg patch 4.

SLUB also uses local_locks so it would be needed there later too.

> At this point it feels that adding 'int count' to existing local_lock_t
> is cleaner.

We have Peter and Thomas in Cc let's see what they think :)