On Tue, Apr 16, 2024 at 10:51 AM Jesper Dangaard Brouer <hawk@xxxxxxxxxx> wrote: > > Since kernel v4.18, cgroup_rstat_lock has been an IRQ-disabling spinlock, > as introduced by commit 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex > with a spinlock"). > > Despite efforts in cgroup_rstat_flush_locked() to yield the lock when > necessary during the collection of per-CPU stats, this approach has led > to several scaling issues observed in production environments. Holding > this IRQ lock has caused starvation of other critical kernel functions, > such as softirq (e.g., timers and netstack). Although kernel v6.8 > introduced optimizations in this area, we continue to observe instances > where the spin_lock is held for 64-128 ms in production. > > This patch converts cgroup_rstat_lock back to being a mutex lock. This > change is made possible thanks to the significant effort by Yosry Ahmed > to eliminate all atomic context use-cases through multiple commits, > ending in 0a2dc6ac3329 ("cgroup: removecgroup_rstat_flush_atomic()"), > included in kernel v6.5. > > After this patch lock contention will be less obvious, as converting this > to a mutex avoids multiple CPUs spinning while waiting for the lock, but > it doesn't remove the lock contention. It is recommended to use the > tracepoints to diagnose this. I will keep the high-level conversation about using the mutex here in the cover letter thread, but I am wondering why we are keeping the lock dropping logic here with the mutex? If this is to reduce lock contention, why does it depend on need_resched()? spin_needbreak() is a good indicator for lock contention, but need_resched() isn't, right? Also, how was this tested? When I did previous changes to the flushing logic I used to make sure that userspace read latency was not impacted, as well as in-kernel flushers (e.g. reclaim). We should make sure there are no regressions on both fronts. > > Signed-off-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx> > --- > kernel/cgroup/rstat.c | 10 +++++----- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c > index ff68c904e647..a90d68a7c27f 100644 > --- a/kernel/cgroup/rstat.c > +++ b/kernel/cgroup/rstat.c > @@ -9,7 +9,7 @@ > > #include <trace/events/cgroup.h> > > -static DEFINE_SPINLOCK(cgroup_rstat_lock); > +static DEFINE_MUTEX(cgroup_rstat_lock); > static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock); > > static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu); > @@ -238,10 +238,10 @@ static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop) > { > bool contended; > > - contended = !spin_trylock_irq(&cgroup_rstat_lock); > + contended = !mutex_trylock(&cgroup_rstat_lock); > if (contended) { > trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended); > - spin_lock_irq(&cgroup_rstat_lock); > + mutex_lock(&cgroup_rstat_lock); > } > trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended); > } > @@ -250,7 +250,7 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop) > __releases(&cgroup_rstat_lock) > { > trace_cgroup_rstat_unlock(cgrp, cpu_in_loop, false); > - spin_unlock_irq(&cgroup_rstat_lock); > + mutex_unlock(&cgroup_rstat_lock); > } > > /* see cgroup_rstat_flush() */ > @@ -278,7 +278,7 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) > } > > /* play nice and yield if necessary */ > - if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) { > + if (need_resched()) { > __cgroup_rstat_unlock(cgrp, cpu); > if (!cond_resched()) > cpu_relax();