On 30/07/2024 20.54, Yosry Ahmed wrote:
[..]
Well... I'm still not convinced that it makes sense to have level >= 2
be the ongoing flusher.
E.g. if a level 2 cgroup becomes ongoing flusher, and kswapd starts 12
NUMA flushes at the same time, then the code will have these 12 kswapd
threads spin on the lock, until ongoing flusher finishes. That is likely
what happened above (for a level 1). These 12 spinning (root) flushers
will not recheck ongoing_flusher and will all flush the root
(unnecessarily 11 times).
Hmm regardless of whether or not the level-2 cgroup becomes the
ongoing flusher, the kswapd threads will all spin on the lock anyway
since none of them can be the ongoing flusher until the level-2 cgroup
finishes. Right?
Is the scenario you have in mind that the level-2 cgroup starts
flushing at the same time as kswapd, so there is a race on who gets to
be the ongoing flusher? In this case as well, whoever gets the lock
will be the ongoing flusher anyway.
Not allowing whoever is holding the lock to be the ongoing flusher
based on level is only useful when we can have multiple ongoing
flushers (with lock yielding). Right?
Perhaps I am missing something here.
So, I don't think it is a good idea to have anything else that the root
as the ongoing flusher.
Can you explain/convince me why having sub-cgroups as ongoing flusher is
an advantage?
I just don't see the benefit of the special casing here as I mentioned
above. If I missed something please let me know.
I do think you missed something. Let me try to explain this in another
way. (I hope my frustrations doesn't shine through).
The main purpose of the patch is/was to stop the thundering herd of
kswapd thread flushing (root-cgrp) at exactly the same time, leading to
lock contention. This happens all-the-time/constantly in production.
The first versions (where ongoing was limited to root/level=0) solved
this 100%. The patches that generalized this to be all levels can
become ongoing flush, doesn't solve the problem any-longer!
I hope it is clear what fails. E.g. When a level>0 becomes ongoing
flusher, and 12 kswapd simultaneously does a level=0/root-cgrp flush,
then we have 12 CPU cores spinning on the rstat lock. (These 12 kswapd
threads will all go-through completing the flush, as they do not
discover/recheck that ongoing flush was previously became their own level).
I think we need to go back to only having root-cgroup as ongoing flusher.
--Jesper