Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock

Tejun Heo <tj@xxxxxxxxxx> · Fri, 24 Mar 2023 15:54:50 -1000

Hello,

On Fri, Mar 24, 2023 at 12:22:09AM -0700, Yosry Ahmed wrote:
> I think a problem with this approach is that we risk having to contend
> for the global lock at every CPU boundary in atomic contexts. Right
> now we contend for the global lock once, and once we have it we go
> through all CPUs to flush, only having to contend with updates taking
> the percpu locks at this point. If we unconditionally release &
> reacquire the global lock at every CPU boundary then we may contend
> for it much more frequently with concurrent flushers.
> 
> On the memory controller side, concurrent flushers are already held
> back to avoid a thundering herd problem on the global rstat lock, but
> flushers from outside the memory controller can still compete together
> or with a flusher from the memory controller. In this case, we risk
> contending the global lock more and concurrent flushers taking a
> longer period of time, which may end up causing multi-CPU stalls
> anyway, right? Also, if we keep _irq when spinning for the lock, then
> concurrent flushers still need to spin with irq disabled -- another
> problem that this series tries to fix.
> 
> This is particularly a problem for flushers in atomic contexts. There
> is a flusher in mem_cgroup_wb_stats() that flushes while holding
> another spinlock, and a flusher in mem_cgroup_usage() that flushes
> with irqs disabled. If flushing takes a longer period of time due to
> repeated lock contention, it affects such atomic context negatively.
> 
> I am not sure how all of this matters in practice, it depends heavily
> on the workloads and the configuration like you mentioned. I am just
> pointing out the potential disadvantages of reacquiring the lock at
> every CPU boundary in atomic contexts.

So, I'm not too convinced by the arguments for a couple reasons:

* It's not very difficult to create conditions where a contented non-irq
  protected spinlock is held unnecessarily long due to heavy IRQ irq load on
  the holding CPU. This can easily extend the amount of time the lock is
  held by multiple times if not orders of magnitude. That is likely a
  significantly worse problem than the contention on the lock cacheline
  which will lead to a lot more gradual degradation.

* If concurrent flushing is an actual problem, we need and can implement a
  better solution. There's quite a bit of maneuvering room here given that
  the flushing operations are mostly idempotent in close time proximity and
  there's no real atomicity requirement across different segments of
  flushing operations.

Thanks.

-- 
tejun