Re: [PATCH V2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes

Jesper Dangaard Brouer <hawk@xxxxxxxxxx> · Wed, 26 Jun 2024 23:35:07 +0200

On 26/06/2024 00.59, Yosry Ahmed wrote:
On Tue, Jun 25, 2024 at 3:35 PM Christoph Lameter (Ampere) <cl@xxxxxxxxx> wrote:

On Tue, 25 Jun 2024, Yosry Ahmed wrote:

In my reply above, I am not arguing to go back to the older
stats_flush_ongoing situation. Rather I am discussing what should be the
best eventual solution. From the vmstats infra, we can learn that
frequent async flushes along with no sync flush, users are fine with the
'non-determinism'. Of course cgroup stats are different from vmstats
i.e. are hierarchical but I think we can try out this approach and see
if this works or not.

If we do not do sync flushing, then the same problem that happened
with stats_flush_ongoing could occur again, right? Userspace could
read the stats after an event, and get a snapshot of the system before
that event.

Perhaps this is fine for vmstats if it has always been like that (I
have no idea), or if no users make assumptions about this. But for
cgroup stats, we have use cases that rely on this behavior.

vmstat updates are triggered initially as needed by the shepherd task and
there is no requirement that this is triggered simultaenously. We
could actually randomize the intervals in vmstat_update() a bit if this
will help.

The problem is that for cgroup stats, the behavior has been that a
userspace read will trigger a flush (i.e. propagating updates). We
have use cases that depend on this. If we switch to the vmstat model
where updates are triggered independently from user reads, it
constitutes a behavioral change.

I implemented a variant using completions as Yosry asked for:

https://lore.kernel.org/all/171943668946.1638606.1320095353103578332.stgit@firesoul/

--Jesper