Re: [mm-unstable v4 5/5] mm: memcg: restore subtree stats flushing

Wei Xu <weixugc@xxxxxxxxxx> · Mon, 4 Dec 2023 15:46:31 -0800

On Mon, Dec 4, 2023 at 3:31 PM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote:
>
> On Mon, Dec 4, 2023 at 1:38 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> >
> > On Mon, Dec 4, 2023 at 12:12 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> > >
> > > On Sat, Dec 2, 2023 at 12:31 AM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote:
> > > >
> > > > On Wed, Nov 29, 2023 at 03:21:53AM +0000, Yosry Ahmed wrote:
> > > > [...]
> > > > > +void mem_cgroup_flush_stats(struct mem_cgroup *memcg)
> > > > >  {
> > > > > -     if (memcg_should_flush_stats(root_mem_cgroup))
> > > > > -             do_flush_stats();
> > > > > +     static DEFINE_MUTEX(memcg_stats_flush_mutex);
> > > > > +
> > > > > +     if (mem_cgroup_disabled())
> > > > > +             return;
> > > > > +
> > > > > +     if (!memcg)
> > > > > +             memcg = root_mem_cgroup;
> > > > > +
> > > > > +     if (memcg_should_flush_stats(memcg)) {
> > > > > +             mutex_lock(&memcg_stats_flush_mutex);
> > > >
> > > > What's the point of this mutex now? What is it providing? I understand
> > > > we can not try_lock here due to targeted flushing. Why not just let the
> > > > global rstat serialize the flushes? Actually this mutex can cause
> > > > latency hiccups as the mutex owner can get resched during flush and then
> > > > no one can flush for a potentially long time.
> > >
> > > I was hoping this was clear from the commit message and code comments,
> > > but apparently I was wrong, sorry. Let me give more context.
> > >
> > > In previous versions and/or series, the mutex was only used with
> > > flushes from userspace to guard in-kernel flushers against high
> > > contention from userspace. Later on, I kept the mutex for all memcg
> > > flushers for the following reasons:
> > >
> > > (a) Allow waiters to sleep:
> > > Unlike other flushers, the memcg flushing path can see a lot of
> > > concurrency. The mutex avoids having a lot of CPUs spinning (e.g.
> > > concurrent reclaimers) by allowing waiters to sleep.
> > >
> > > (b) Check the threshold under lock but before calling cgroup_rstat_flush():
> > > The calls to cgroup_rstat_flush() are not very cheap even if there's
> > > nothing to flush, as we still need to iterate all CPUs. If flushers
> > > contend directly on the rstat lock, overlapping flushes will
> > > unnecessarily do the percpu iteration once they hold the lock. With
> > > the mutex, they will check the threshold again once they hold the
> > > mutex.
> > >
> > > (c) Protect non-memcg flushers from contention from memcg flushers.
> > > This is not as strong of an argument as protecting in-kernel flushers
> > > from userspace flushers.
> > >
> > > There has been discussions before about changing the rstat lock itself
> > > to be a mutex, which would resolve (a), but there are concerns about
> > > priority inversions if a low priority task holds the mutex and gets
> > > preempted, as well as the amount of time the rstat lock holder keeps
> > > the lock for:
> > > https://lore.kernel.org/lkml/ZO48h7c9qwQxEPPA@xxxxxxxxxxxxxxx/
> > >
> > > I agree about possible hiccups due to the inner lock being dropped
> > > while the mutex is held. Running a synthetic test with high
> > > concurrency between reclaimers (in-kernel flushers) and stats readers
> > > show no material performance difference with or without the mutex.
> > > Maybe things cancel out, or don't really matter in practice.
> > >
> > > I would prefer to keep the current code as I think (a) and (b) could
> > > cause problems in the future, and the current form of the code (with
> > > the mutex) has already seen mileage with production workloads.
> >
> > Correction: The priority inversion is possible on the memcg side due
> > to the mutex in this patch. Also, for point (a), the spinners will
> > eventually sleep once they hold the lock and hit the first CPU
> > boundary -- because of the lock dropping and cond_resched(). So
> > eventually, all spinners should be able to sleep, although it will be
> > a while until they do. With the mutex, they all sleep from the
> > beginning. Point (b) still holds though.
> >
> > I am slightly inclined to keep the mutex but I can send a small fixlet
> > to remove it if others think otherwise.
> >
> > Shakeel, Wei, any preferences?
>
> My preference is to avoid the issue we know we see in production alot
> i.e. priority inversion.
>
> In future if you see issues with spinning then you can come up with
> the lockless flush mechanism at that time.

Given that the synthetic high concurrency test doesn't show material
performance difference between the mutex and non-mutex versions, I
agree that the mutex can be taken out from this patch set (one less
global mutex to worry about).