On Tue, Jan 21, 2025 at 8:43 AM Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx> wrote: > > On 2025-01-21 16:59:40 [+0100], Vlastimil Babka wrote: > > I don't think it would work, or am I missing something? The goal is to > > allow the operation (alloc, free) to opportunistically succeed in e.g. > > nmi context, but only if we didn't interrupt anything that holds the > > lock. Otherwise we must allow for failure - hence trylock. > > (a possible extension that I mentioned is to also stop doing irqsave to > > avoid its overhead and thus also operations from an irq context would be > > oportunistic) > > But if we detect the "trylock must fail" cases only using lockdep, we'll > > deadlock without lockdep. So e.g. the "active" flag has to be there? > > You are right. I noticed that myself but didn't get to reply… > > > So yes this goes beyond the original purpose of local_lock. Do you think > > it should be a different lock type then, which would mean the other > > users of current local_lock that don't want the opportunistic nesting > > via trylock, would not inflict the "active" flag overhead? > > > > AFAICS the RT implementation of local_lock could then be shared for both > > of these types, but I might be missing some nuance there. > > I was thinking about this over the weekend and this implementation > extends the data structure by 4 bytes and has this mandatory read/ write > on every lock/ unlock operation. This is what makes it a bit different > than the original. > > If the local_lock_t is replaced with spinlock_t then the data structure > is still extended by four bytes (assuming no lockdep) and we have a > mandatory read/ write operation. The whole thing still does not work on > PREEMPT_RT but it isn't much different from what we have now. This is > kind of my favorite. This could be further optimized to avoid the atomic > operation given it is always local per-CPU memory. Maybe a > local_spinlock_t :) I think the concern of people with pitchforks is overblown. These are the only users of local_lock_t: drivers/block/zram/zcomp.h: local_lock_t lock; drivers/char/random.c: local_lock_t lock; drivers/connector/cn_proc.c: local_lock_t lock; drivers/md/raid5.h: local_lock_t lock; kernel/softirq.c: local_lock_t lock; mm/memcontrol.c: local_lock_t stock_lock; mm/mlock.c: local_lock_t lock; mm/slub.c: local_lock_t lock; mm/swap.c: local_lock_t lock; mm/swap.c: local_lock_t lock_irq; mm/zsmalloc.c: local_lock_t lock; net/bridge/br_netfilter_hooks.c: local_lock_t bh_lock; net/core/skbuff.c: local_lock_t bh_lock; net/ipv4/tcp_sigpool.c: local_lock_t bh_lock; Networking is the one that really cares about performance and there 'int active' adds 4 bytes, but no run-time overhead, since it's using local_lock_nested_bh() that doesn't touch 'active'. memcontrol.c and slub.c are those that need these new trylock logic with 'active' field to protect from NMI on !RT. They will change to new local_trylock_t anyway. What's left is not perf critical. Single WRITE_ONCE in lock and another WRITE_ONCE in unlock is really in the noise. Hence I went with a simple approach you see in this patch. I can go with new local_trylock_t and _Generic() trick. The patch will look like this: diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h index 8dd71fbbb6d2..ed4623e0c71a 100644 --- a/include/linux/local_lock_internal.h +++ b/include/linux/local_lock_internal.h @@ -15,6 +15,14 @@ typedef struct { #endif } local_lock_t; +typedef struct { +#ifdef CONFIG_DEBUG_LOCK_ALLOC + struct lockdep_map dep_map; + struct task_struct *owner; +#endif + int active; +} local_trylock_t; #define __local_lock_irqsave(lock, flags) \ do { \ + local_lock_t *l; \ local_irq_save(flags); \ - local_lock_acquire(this_cpu_ptr(lock)); \ + l = (local_lock_t *)this_cpu_ptr(lock); \ + _Generic((lock), \ + local_trylock_t *: \ + ({ \ + lockdep_assert(((local_trylock_t *)l)->active == 0); \ + WRITE_ONCE(((local_trylock_t *)l)->active, 1); \ + }), \ + default:(void)0); \ + local_lock_acquire(l); \ } while (0) + +#define __local_trylock_irqsave(lock, flags) \ + ({ \ + local_trylock_t *l; \ + local_irq_save(flags); \ + l = this_cpu_ptr(lock); \ + if (READ_ONCE(l->active) == 1) { \ + local_irq_restore(flags); \ + l = NULL; \ + } else { \ + WRITE_ONCE(l->active, 1); \ + local_trylock_acquire((local_lock_t *)l); \ + } \ + !!l; \ + }) and use new local_trylock_t where necessary. Like: diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7b3503d12aaf..8fe141e93a0b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1722,7 +1722,7 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg) } struct memcg_stock_pcp { - local_lock_t stock_lock; + local_trylock_t stock_lock; All lines where stock_lock is used will stay as-is. So no code churn. Above _Generic() isn't pretty, but not horrible either. I guess that's a decent trade off. Better ideas?