Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Fri, 24 Mar 2023 15:50:10 -0700

On Fri, Mar 24, 2023 at 7:12 AM Waiman Long <longman@xxxxxxxxxx> wrote:
>
> On 3/24/23 03:22, Yosry Ahmed wrote:
> > On Thu, Mar 23, 2023 at 6:39 PM Tejun Heo <tj@xxxxxxxxxx> wrote:
> >> Hello,
> >>
> >> On Thu, Mar 23, 2023 at 04:00:31AM +0000, Yosry Ahmed wrote:
> >>> Currently, when sleeping is not allowed during rstat flushing, we hold
> >>> the global rstat lock with interrupts disabled throughout the entire
> >>> flush operation. Flushing in an O(# cgroups * # cpus) operation, and
> >>> having interrupts disabled throughout is dangerous.
> >>>
> >>> For some contexts, we may not want to sleep, but can be interrupted
> >>> (e.g. while holding a spinlock or RCU read lock). As such, do not
> >>> disable interrupts throughout rstat flushing, only when holding the
> >>> percpu lock. This breaks down the O(# cgroups * # cpus) duration with
> >>> interrupts disabled to a series of O(# cgroups) durations.
> >>>
> >>> Furthermore, if a cpu spinning waiting for the global rstat lock, it
> >>> doesn't need to spin with interrupts disabled anymore.
> >> I'm generally not a fan of big spin locks w/o irq protection. They too often
> >> become a source of unpredictable latency spikes. As you said, the global
> >> rstat lock can be held for quite a while. Removing _irq makes irq latency
> >> better on the CPU but on the other hand it makes a lot more likely that the
> >> lock is gonna be held even longer, possibly significantly so depending on
> >> the configuration and workload which will in turn stall other CPUs waiting
> >> for the lock. Sure, irqs are being serviced quicker but if the cost is more
> >> and longer !irq context multi-cpu stalls, what's the point?
> >>
> >> I don't think there's anything which requires the global lock to be held
> >> throughout the entire flushing sequence and irq needs to be disabled when
> >> grabbing the percpu lock anyway, so why not just release the global lock on
> >> CPU boundaries instead? We don't really lose anything significant that way.
> >> The durations of irq disabled sections are still about the same as in the
> >> currently proposed solution at O(# cgroups) and we avoid the risk of holding
> >> the global lock for too long unexpectedly from getting hit repeatedly by
> >> irqs while holding the global lock.
> > Thanks for taking a look!
> >
> > I think a problem with this approach is that we risk having to contend
> > for the global lock at every CPU boundary in atomic contexts. Right
> Isn't it the plan to just do a trylock in atomic contexts so that it
> won't get stuck spinning for the lock for an indeterminate amount of time?

Not exactly. On the memory controller side, we currently only allow
one flusher at a time and force all flushers to flush the full
hierarchy, such that concurrent flushers can skip. This is done for
both atomic and non-atomic contexts.

For flushers outside the memory controller, they can still contend the
lock among themselves or with flushers in the memory controller. In
this case, instead of contending the lock once, they contend it at
each CPU boundary.

> > now we contend for the global lock once, and once we have it we go
> > through all CPUs to flush, only having to contend with updates taking
> > the percpu locks at this point. If we unconditionally release &
> > reacquire the global lock at every CPU boundary then we may contend
> > for it much more frequently with concurrent flushers.
>
> Note that with the use of qspinlock in all the major arches, the impact
> of thundering herds of lockers are much less serious than before. There
> are certainly some overhead in doing multiple lock acquires and
> releases, but that shouldn't been too excessive.

I ran some tests to measure this. Since I am using a cgroup v1
hierarchy, I cannot reproduce contention between memory controller
flushers and non-memory controller flushers, so I removed the "one
memory flusher only" restriction to have concurrent memory flushers
compete for the global rstat lock to measure the impact:

Before (only one flusher allowed to compete for the global rstat lock):
            ---cgroup_rstat_flush
               |
                --1.27%--cgroup_rstat_flush_locked
                          |
                           --0.94%--mem_cgroup_css_rstat_flush

After (concurrent flushers allowed to compete for the global rstat lock):
            ---cgroup_rstat_flush
               |
               |--4.94%--_raw_spin_lock
               |          |
               |           --4.94%--queued_spin_lock_slowpath
               |
                --0.92%--cgroup_rstat_flush_locked
                          |
                           --0.56%--mem_cgroup_css_rstat_flush

This was run with 20 processes trying to flush concurrently, so it may
be excessive, but it seems like in this case lock contention makes a
significant difference.

Again, this is not a regression for non-atomic flushers, as they
already compete for the lock at every CPU boundary, but for atomic
flushers that don't give up the lock at all today, it would be a
regression to start competing for the lock at every CPU boundary. This
patch series aims to minimize the number of atomic flushers (brings
them down to two, one of which is not common), so this may be fine.

My main concern is that for some flushers that this series converts
from atomic to non-atomic, we may notice a regression later and revert
it (e.g. refault path), which is why I have them in separate patches.
If we regress the atomic flushing path, it would be a larger surgery
to restore the performance for these paths -- which is why I would
rather keep the atomic path without excessive lock contention.

Thoughts?

>
> I am all in for reducing lock hold time as much as possible as it will
> improve the response time.
>
> Cheers,
> Longman
>