On Thu, Apr 18, 2024 at 2:02 AM Jesper Dangaard Brouer <hawk@xxxxxxxxxx> wrote: > > > > On 18/04/2024 04.19, Yosry Ahmed wrote: > > On Tue, Apr 16, 2024 at 10:51 AM Jesper Dangaard Brouer <hawk@xxxxxxxxxx> wrote: > >> > >> Since kernel v4.18, cgroup_rstat_lock has been an IRQ-disabling spinlock, > >> as introduced by commit 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex > >> with a spinlock"). > >> > >> Despite efforts in cgroup_rstat_flush_locked() to yield the lock when > >> necessary during the collection of per-CPU stats, this approach has led > >> to several scaling issues observed in production environments. Holding > >> this IRQ lock has caused starvation of other critical kernel functions, > >> such as softirq (e.g., timers and netstack). Although kernel v6.8 > >> introduced optimizations in this area, we continue to observe instances > >> where the spin_lock is held for 64-128 ms in production. > >> > >> This patch converts cgroup_rstat_lock back to being a mutex lock. This > >> change is made possible thanks to the significant effort by Yosry Ahmed > >> to eliminate all atomic context use-cases through multiple commits, > >> ending in 0a2dc6ac3329 ("cgroup: removecgroup_rstat_flush_atomic()"), > >> included in kernel v6.5. > >> > >> After this patch lock contention will be less obvious, as converting this > >> to a mutex avoids multiple CPUs spinning while waiting for the lock, but > >> it doesn't remove the lock contention. It is recommended to use the > >> tracepoints to diagnose this. > > > > I will keep the high-level conversation about using the mutex here in > > the cover letter thread, but I am wondering why we are keeping the > > lock dropping logic here with the mutex? > > > > I agree that yielding the mutex in the loop makes less sense. > Especially since the raw_spin_unlock_irqrestore(cpu_lock, flags) call > will be a preemption point for my softirq. But I kept it because, we > are running a CONFIG_PREEMPT_VOLUNTARY kernel, so I still worried that > there was no sched point for other userspace processes while holding the > mutex, but I don't fully know the sched implication when holding a mutex. I guess dropping the lock before rescheduling could be more preferable in this case since we do not need to keep holding the lock for correctness. > > > If this is to reduce lock contention, why does it depend on > > need_resched()? spin_needbreak() is a good indicator for lock > > contention, but need_resched() isn't, right? > > > > As I said, I'm unsure of the semantics of holding a mutex. > > > > Also, how was this tested? > > > > I tested this in a testlab, prior to posting upstream, with parallel > reader of the stat files. I believe high concurrency is a key point here. CC'ing Wei who reported regressions on previous attempts of mine before to address the lock contention from userspace. > As I said in other mail, I plan to experiment > with these patches(2+3) in production, as micro-benchmarking will not > reveal the corner cases we care about. Right, but micro-benchmarking should give us a signal about regressions. It was very useful for me when working with this code before to use synthetic benchmarks with high concurrency of userspace reads and/or kernel flushers.