Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Tue, 29 Aug 2023 12:54:06 -0700

On Tue, Aug 29, 2023 at 12:36 PM Tejun Heo <tj@xxxxxxxxxx> wrote:
>
> Hello,
>
> On Tue, Aug 29, 2023 at 12:13:31PM -0700, Yosry Ahmed wrote:
> ...
> > > So, the assumptions in the original design were:
> > >
> > > * Writers are high freq but readers are lower freq and can block.
> > >
> > > * The global lock is mutex.
> > >
> > > * Back-to-back reads won't have too much to do because it only has to flush
> > >   what's been accumulated since the last flush which took place just before.
> > >
> > > It's likely that the userspace side is gonna be just fine if we restore the
> > > global lock to be a mutex and let them be. Most of the problems are caused
> > > by trying to allow flushing from non-sleepable and kernel contexts.
> >
> > So basically restore the flush without disabling preemption, and if a
> > userspace reader gets preempted while holding the mutex it's probably
> > okay because we won't have high concurrency among userspace readers?
> >
> > I think Shakeel was worried that this may cause a priority inversion
> > where a low priority task is preempted while holding the mutex, and
> > prevents high priority tasks from acquiring it to read the stats and
> > take actions (e.g. userspace OOMs).
>
> We'll have to see but I'm not sure this is going to be a huge problem. The
> most common way that priority inversions are resolved is through work
> conservation - ie. the system just runs out of other things to do, so
> whatever is blocking the system gets to run and unblocks it. It only really
> becomes a problem when work conservation breaks down on CPU side which
> happens if the one holding the resource is 1. blocked on IOs (usually
> through memory allocation but can be anything) 2. throttled by cpu.max.
>
> #1 is not a factor here. #2 is but that is a factor for everything in the
> kernel and should really be solved from cpu.max side. So, I think in
> practice, this should be fine, or at least not worse than anything else.

If that's not a concern I can just append a patch that changes the
spinlock to a mutex to this series. Shakeel, wdyt?

>
> > > Would it
> > > make sense to distinguish what can and can't wait and make the latter group
> > > always use cached value? e.g. even in kernel, during oom kill, waiting
> > > doesn't really matter and it can just wait to obtain the up-to-date numbers.
> >
> > The problem with waiting for in-kernel flushers is that high
> > concurrency leads to terrible serialization. Running a stress test
> > with 100s of reclaimers where everyone waits shows ~ 3x slowdowns.
> >
> > This patch series is indeed trying to formalize a distinction between
> > waiters who can wait and those who can't on the memcg side:
> >
> > - Unified flushers always flush the entire tree and only flush if no
> > one else is flushing (no waiting), otherwise they use cached data and
> > hope the concurrent flushing helps. This is what we currently do for
> > most memcg contexts. This patch series opts userspace reads out of it.
> >
> > - Non-unified flushers only flush the subtree they care about and they
> > wait if there are other flushers. This is what we currently do for
> > some zswap accounting code. This patch series opts userspace readers
> > into this.
> >
> > The problem Michal is raising is that dropping the lock can lead to an
> > unbounded number of waiters and longer worst case latency. Especially
> > that this is directly influenced by userspace. Reintroducing the mutex
> > and removing the lock dropping code fixes that problem, but then if
> > the mutex holder gets preempted, we face another problem.
> >
> > Personally I think there is a good chance there won't be userspace
> > latency problems due to the lock as usually there isn't high
> > concurrency among userspace readers, and we can deal with that problem
> > if/when it happens. So far no problem is happening for cpu.stat which
> > has the same potential problem.
>
> Maybe leave the global lock as-is and gate the userland flushers with a
> mutex so that there's only ever one contenting on the rstat lock from
> userland side?

Waiman suggested this as well. We can do that for sure, although I
think we should wait until we are sure it's needed.

One question. If whoever is holding that mutex is either flushing with
the spinlock held or spinning (i.e not sleepable or preemptable),
wouldn't this be equivalent to just changing the spinlock with a mutex
and disable preemption while holding it?