Re: [PATCH bpf-next v5 3/7] locking/local_lock: Introduce local_trylock_irqsave()

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Tue, 21 Jan 2025 17:35:49 -0800

On Tue, Jan 21, 2025 at 8:43 AM Sebastian Andrzej Siewior
<bigeasy@xxxxxxxxxxxxx> wrote:
>
> On 2025-01-21 16:59:40 [+0100], Vlastimil Babka wrote:
> > I don't think it would work, or am I missing something? The goal is to
> > allow the operation (alloc, free) to opportunistically succeed in e.g.
> > nmi context, but only if we didn't interrupt anything that holds the
> > lock. Otherwise we must allow for failure - hence trylock.
> > (a possible extension that I mentioned is to also stop doing irqsave to
> > avoid its overhead and thus also operations from an irq context would be
> > oportunistic)
> > But if we detect the "trylock must fail" cases only using lockdep, we'll
> > deadlock without lockdep. So e.g. the "active" flag has to be there?
>
> You are right. I noticed that myself but didn't get to reply…
>
> > So yes this goes beyond the original purpose of local_lock. Do you think
> > it should be a different lock type then, which would mean the other
> > users of current local_lock that don't want the opportunistic nesting
> > via trylock, would not inflict the "active" flag overhead?
> >
> > AFAICS the RT implementation of local_lock could then be shared for both
> > of these types, but I might be missing some nuance there.
>
> I was thinking about this over the weekend and this implementation
> extends the data structure by 4 bytes and has this mandatory read/ write
> on every lock/ unlock operation. This is what makes it a bit different
> than the original.
>
> If the local_lock_t is replaced with spinlock_t then the data structure
> is still extended by four bytes (assuming no lockdep) and we have a
> mandatory read/ write operation. The whole thing still does not work on
> PREEMPT_RT but it isn't much different from what we have now. This is
> kind of my favorite. This could be further optimized to avoid the atomic
> operation given it is always local per-CPU memory. Maybe a
> local_spinlock_t :)

I think the concern of people with pitchforks is overblown.
These are the only users of local_lock_t:

drivers/block/zram/zcomp.h:     local_lock_t lock;
drivers/char/random.c:  local_lock_t lock;
drivers/connector/cn_proc.c:    local_lock_t lock;
drivers/md/raid5.h:     local_lock_t    lock;
kernel/softirq.c:       local_lock_t    lock;
mm/memcontrol.c:        local_lock_t stock_lock;
mm/mlock.c:     local_lock_t lock;
mm/slub.c:      local_lock_t lock;
mm/swap.c:      local_lock_t lock;
mm/swap.c:      local_lock_t lock_irq;
mm/zsmalloc.c:  local_lock_t lock;
net/bridge/br_netfilter_hooks.c:        local_lock_t bh_lock;
net/core/skbuff.c:      local_lock_t bh_lock;
net/ipv4/tcp_sigpool.c: local_lock_t bh_lock;

Networking is the one that really cares about performance
and there 'int active' adds 4 bytes, but no run-time overhead,
since it's using local_lock_nested_bh() that doesn't touch 'active'.

memcontrol.c and slub.c are those that need these new trylock
logic with 'active' field to protect from NMI on !RT.
They will change to new local_trylock_t anyway.

What's left is not perf critical. Single WRITE_ONCE in lock
and another WRITE_ONCE in unlock is really in the noise.

Hence I went with a simple approach you see in this patch.
I can go with new local_trylock_t and _Generic() trick.
The patch will look like this:

diff --git a/include/linux/local_lock_internal.h
b/include/linux/local_lock_internal.h
index 8dd71fbbb6d2..ed4623e0c71a 100644
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -15,6 +15,14 @@ typedef struct {
 #endif
 } local_lock_t;

+typedef struct {
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+       struct lockdep_map      dep_map;
+       struct task_struct      *owner;
+#endif
+       int active;
+} local_trylock_t;

 #define __local_lock_irqsave(lock, flags)                      \
        do {                                                    \
+               local_lock_t *l;                                \
                local_irq_save(flags);                          \
-               local_lock_acquire(this_cpu_ptr(lock));         \
+               l = (local_lock_t *)this_cpu_ptr(lock);         \
+               _Generic((lock),                                \
+                       local_trylock_t *:                      \
+                       ({                                      \
+                       lockdep_assert(((local_trylock_t *)l)->active == 0); \
+                        WRITE_ONCE(((local_trylock_t *)l)->active, 1); \
+                       }),                                     \
+                       default:(void)0);                       \
+               local_lock_acquire(l);                          \
        } while (0)

+
+#define __local_trylock_irqsave(lock, flags)                   \
+       ({                                                      \
+               local_trylock_t *l;                             \
+               local_irq_save(flags);                          \
+               l = this_cpu_ptr(lock);                         \
+               if (READ_ONCE(l->active) == 1) {                \
+                       local_irq_restore(flags);               \
+                       l = NULL;                               \
+               } else {                                        \
+                       WRITE_ONCE(l->active, 1);               \
+                       local_trylock_acquire((local_lock_t *)l);
         \
+               }                                               \
+               !!l;                                            \
+       })

and use new local_trylock_t where necessary.
Like:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7b3503d12aaf..8fe141e93a0b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1722,7 +1722,7 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg)
 }

 struct memcg_stock_pcp {
-       local_lock_t stock_lock;
+       local_trylock_t stock_lock;

All lines where stock_lock is used will stay as-is.
So no code churn.
Above _Generic() isn't pretty, but not horrible either.
I guess that's a decent trade off.
Better ideas?