On Wed, Oct 5, 2022 at 11:38 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > On Wed, Oct 5, 2022 at 11:22 AM Tejun Heo <tj@xxxxxxxxxx> wrote: > > > > Hello, > > > > On Wed, Oct 05, 2022 at 11:02:23AM -0700, Yosry Ahmed wrote: > > > > I was thinking more that being done inside the flush function. > > > > > > I think the flush function already does that in some sense if > > > might_sleep is true, right? The problem here is that we are using > > > > Oh I forgot about that. Right. > > > > ... > > > I took a couple of crashed machines kdumps and ran a script to > > > traverse updated memcgs and check how many cpus have updates and how > > > many updates are there on each cpu. I found that on average only a > > > couple of stats are updated per-cpu per-cgroup, and less than 25% of > > > cpus (but this is on a large machine, I expect the number to go higher > > > on smaller machines). Which is why I suggested a bitmask. I understand > > > though that this depends on whatever workloads were running on those > > > machines, and that in case where most stats are updated the bitmask > > > will actually make things slightly worse. > > > > One worry I have about selective flushing is that it's only gonna improve > > things by some multiples while we can reasonably increase the problem size > > by orders of magnitude. > > I think we would usually want to flush a few stats (< 5?) in irqsafe > contexts out of over 100, so I would say the improvement would be > good, but yeah, the problem size can reasonably increase more than > that. It also depends on which stats we selectively flush. If they are > not in the same cache line we might end up bringing in a lot of stats > anyway into the cpu cache. > > > > > The only real ways out I can think of are: > > > > * Implement a periodic flusher which keeps the stats needed in irqsafe path > > acceptably uptodate to avoid flushing with irq disabled. We can make this > > adaptive too - no reason to do all this if the number to flush isn't huge. > > We do have a periodic flusher today for memcg stats (see > flush_memcg_stats_dwork). It calls __mem_cgroup_flush_stas() which > only flushes if the total number of updates is over a certain > threshold. > mem_cgroup_flush_stas_delayed(), which is called in the page fault > path, only does a flush if the last flush was a certain while ago. We > don't use the delayed version in all irqsafe contexts though, and I am > not the right person to tell if we can. > > But I think this is not what you meant. I think you meant only > flushing the specific stats needed in irqsafe contexts more frequently > and not invoking a flush at all in irqsafe contexts (or using > mem_cgroup_flush_stas_delayed()..?). Right? > > I am not the right person to judge what is acceptably up-to-date to be > honest, so I would wait for other memcgs folks to chime in on this. > > > > > * Shift some work to the updaters. e.g. in many cases, propagating per-cpu > > updates a couple levels up from update path will significantly reduce the > > fanouts and thus the number of entries which need to be flushed later. It > > does add on-going overhead, so it prolly should adaptive or configurable, > > hopefully the former. > > If we are adding overhead to the updaters, would it be better to > maintain a bitmask of updated stats, or do you think it would be more > effective to propagate updates a couple of levels up? I think to > propagate updates up in updaters context we would need percpu versions > of the "pending" stats, which would also add memory consumption. > Any thoughts here, Tejun or anyone? > > > > Thanks. > > > > -- > > tejun