[..] > > > Perhaps we could experiment with always dropping the lock at CPU > > > boundaries instead? > > > > > > > I don't think this will be enough (always dropping the lock at CPU > > boundaries). My measured "lock-hold" times that is blocking IRQ (and > > softirq) for too long. When looking at prod with my new cgroup > > tracepoint script[2]. When contention occurs, I see many Yields > > happening and with same magnitude as Contended. But still see events > > with long "lock-hold" times, even-though yields are high. > > > > [2] https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_tracepoint.bt > > > > Example output: > > > > 12:46:56 High Lock-contention: wait: 739 usec (0 ms) on CPU:56 comm:kswapd7 > > 12:46:56 Long lock-hold time: 6381 usec (6 ms) on CPU:27 comm:kswapd3 > > 12:46:56 Long lock-hold time: 18905 usec (18 ms) on CPU:100 > > comm:kworker/u261:12 > > > > 12:46:56 time elapsed: 36 sec (interval = 1 sec) > > Flushes(2051) 15/interval (avg 56/sec) > > Locks(44464) 1340/interval (avg 1235/sec) > > Yields(42413) 1325/interval (avg 1178/sec) > > Contended(42112) 1322/interval (avg 1169/sec) > > > > There is reported 15 flushes/sec, but locks are yielded quickly. > > > > More problematically (for softirq latency) we see a Long lock-hold time > > reaching 18 ms. For network RX softirq I need lower than 0.5ms latency, > > to avoid RX-ring HW queue overflows. Here we are measuring yields against contention, but the main problem here is IRQ serving latency, which doesn't have to correlate with contention, right? Perhaps contention is causing us to yield the lock every nth cpu boundary, but apparently this is not enough for IRQ serving latency. Dropping the lock on each boundary should improve IRQ serving latency, regardless of the presence of contention. Let's focus on one problem at a time ;) > > > > > > --Jesper > > p.s. I'm seeing a pattern with kswapdN contending on this lock. > > > > @stack[697, kswapd3]: > > __cgroup_rstat_lock+107 > > __cgroup_rstat_lock+107 > > cgroup_rstat_flush_locked+851 > > cgroup_rstat_flush+35 > > shrink_node+226 > > balance_pgdat+807 > > kswapd+521 > > kthread+228 > > ret_from_fork+48 > > ret_from_fork_asm+27 > > > > @stack[698, kswapd4]: > > __cgroup_rstat_lock+107 > > __cgroup_rstat_lock+107 > > cgroup_rstat_flush_locked+851 > > cgroup_rstat_flush+35 > > shrink_node+226 > > balance_pgdat+807 > > kswapd+521 > > kthread+228 > > ret_from_fork+48 > > ret_from_fork_asm+27 > > > > @stack[699, kswapd5]: > > __cgroup_rstat_lock+107 > > __cgroup_rstat_lock+107 > > cgroup_rstat_flush_locked+851 > > cgroup_rstat_flush+35 > > shrink_node+226 > > balance_pgdat+807 > > kswapd+521 > > kthread+228 > > ret_from_fork+48 > > ret_from_fork_asm+27 > > > > Can you simply replace mem_cgroup_flush_stats() in > prepare_scan_control() with the ratelimited version and see if the issue > still persists for your production traffic? With thresholding, the fact that we reach cgroup_rstat_flush() means that there is a high magnitude of pending updates. I think Jesper mentioned 128 CPUs before, that means 128 * 64 (MEMCG_CHARGE_BATCH) page-sized updates. That could be over 33 MBs with 4K page size. I am not sure if it's fine to ignore such updates in shrink_node(), especially that it is called in a loop sometimes so I imagine we may want to see what changed after the last iteration. > > Also were you able to get which specific stats are getting the most > updates? This, on the other hand, would be very interesting. I think it is very possible that we don't actually have 33 MBs of updates, but rather we keep adding and subtracting from the same stat until we reach the threshold. This could especially be true for hot stats like slab allocations.