Re: [RFC] memcg rstat flushing optimization

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Wed, 5 Oct 2022 11:38:04 -0700

On Wed, Oct 5, 2022 at 11:22 AM Tejun Heo <tj@xxxxxxxxxx> wrote:
>
> Hello,
>
> On Wed, Oct 05, 2022 at 11:02:23AM -0700, Yosry Ahmed wrote:
> > > I was thinking more that being done inside the flush function.
> >
> > I think the flush function already does that in some sense if
> > might_sleep is true, right? The problem here is that we are using
>
> Oh I forgot about that. Right.
>
> ...
> > I took a couple of crashed machines kdumps and ran a script to
> > traverse updated memcgs and check how many cpus have updates and how
> > many updates are there on each cpu. I found that on average only a
> > couple of stats are updated per-cpu per-cgroup, and less than 25% of
> > cpus (but this is on a large machine, I expect the number to go higher
> > on smaller machines). Which is why I suggested a bitmask. I understand
> > though that this depends on whatever workloads were running on those
> > machines, and that in case where most stats are updated the bitmask
> > will actually make things slightly worse.
>
> One worry I have about selective flushing is that it's only gonna improve
> things by some multiples while we can reasonably increase the problem size
> by orders of magnitude.

I think we would usually want to flush a few stats (< 5?) in irqsafe
contexts out of over 100, so I would say the improvement would be
good, but yeah, the problem size can reasonably increase more than
that. It also depends on which stats we selectively flush. If they are
not in the same cache line we might end up bringing in a lot of stats
anyway into the cpu cache.

>
> The only real ways out I can think of are:
>
> * Implement a periodic flusher which keeps the stats needed in irqsafe path
>   acceptably uptodate to avoid flushing with irq disabled. We can make this
>   adaptive too - no reason to do all this if the number to flush isn't huge.

We do have a periodic flusher today for memcg stats (see
flush_memcg_stats_dwork). It calls __mem_cgroup_flush_stas() which
only flushes if the total number of updates is over a certain
threshold.
mem_cgroup_flush_stas_delayed(), which is called in the page fault
path, only does a flush if the last flush was a certain while ago. We
don't use the delayed version in all irqsafe contexts though, and I am
not the right person to tell if we can.

But I think this is not what you meant. I think you meant only
flushing the specific stats needed in irqsafe contexts more frequently
and not invoking a flush at all in irqsafe contexts (or using
mem_cgroup_flush_stas_delayed()..?). Right?

I am not the right person to judge what is acceptably up-to-date to be
honest, so I would wait for other memcgs folks to chime in on this.

>
> * Shift some work to the updaters. e.g. in many cases, propagating per-cpu
>   updates a couple levels up from update path will significantly reduce the
>   fanouts and thus the number of entries which need to be flushed later. It
>   does add on-going overhead, so it prolly should adaptive or configurable,
>   hopefully the former.

If we are adding overhead to the updaters, would it be better to
maintain a bitmask of updated stats, or do you think it would be more
effective to propagate updates a couple of levels up? I think to
propagate updates up in updaters context we would need percpu versions
of the "pending" stats, which would also add memory consumption.

>
> Thanks.
>
> --
> tejun