Re: [RFC] memcg rstat flushing optimization

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Mon, 10 Oct 2022 17:15:33 -0700

On Wed, Oct 5, 2022 at 11:38 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
>
> On Wed, Oct 5, 2022 at 11:22 AM Tejun Heo <tj@xxxxxxxxxx> wrote:
> >
> > Hello,
> >
> > On Wed, Oct 05, 2022 at 11:02:23AM -0700, Yosry Ahmed wrote:
> > > > I was thinking more that being done inside the flush function.
> > >
> > > I think the flush function already does that in some sense if
> > > might_sleep is true, right? The problem here is that we are using
> >
> > Oh I forgot about that. Right.
> >
> > ...
> > > I took a couple of crashed machines kdumps and ran a script to
> > > traverse updated memcgs and check how many cpus have updates and how
> > > many updates are there on each cpu. I found that on average only a
> > > couple of stats are updated per-cpu per-cgroup, and less than 25% of
> > > cpus (but this is on a large machine, I expect the number to go higher
> > > on smaller machines). Which is why I suggested a bitmask. I understand
> > > though that this depends on whatever workloads were running on those
> > > machines, and that in case where most stats are updated the bitmask
> > > will actually make things slightly worse.
> >
> > One worry I have about selective flushing is that it's only gonna improve
> > things by some multiples while we can reasonably increase the problem size
> > by orders of magnitude.
>
> I think we would usually want to flush a few stats (< 5?) in irqsafe
> contexts out of over 100, so I would say the improvement would be
> good, but yeah, the problem size can reasonably increase more than
> that. It also depends on which stats we selectively flush. If they are
> not in the same cache line we might end up bringing in a lot of stats
> anyway into the cpu cache.
>
> >
> > The only real ways out I can think of are:
> >
> > * Implement a periodic flusher which keeps the stats needed in irqsafe path
> >   acceptably uptodate to avoid flushing with irq disabled. We can make this
> >   adaptive too - no reason to do all this if the number to flush isn't huge.
>
> We do have a periodic flusher today for memcg stats (see
> flush_memcg_stats_dwork). It calls __mem_cgroup_flush_stas() which
> only flushes if the total number of updates is over a certain
> threshold.
> mem_cgroup_flush_stas_delayed(), which is called in the page fault
> path, only does a flush if the last flush was a certain while ago. We
> don't use the delayed version in all irqsafe contexts though, and I am
> not the right person to tell if we can.
>
> But I think this is not what you meant. I think you meant only
> flushing the specific stats needed in irqsafe contexts more frequently
> and not invoking a flush at all in irqsafe contexts (or using
> mem_cgroup_flush_stas_delayed()..?). Right?
>
> I am not the right person to judge what is acceptably up-to-date to be
> honest, so I would wait for other memcgs folks to chime in on this.
>
> >
> > * Shift some work to the updaters. e.g. in many cases, propagating per-cpu
> >   updates a couple levels up from update path will significantly reduce the
> >   fanouts and thus the number of entries which need to be flushed later. It
> >   does add on-going overhead, so it prolly should adaptive or configurable,
> >   hopefully the former.
>
> If we are adding overhead to the updaters, would it be better to
> maintain a bitmask of updated stats, or do you think it would be more
> effective to propagate updates a couple of levels up? I think to
> propagate updates up in updaters context we would need percpu versions
> of the "pending" stats, which would also add memory consumption.
>

Any thoughts here, Tejun or anyone?

> >
> > Thanks.
> >
> > --
> > tejun