(I'll leave the dirty throttling discussion to Johannes, because I'm not so
familiar with that code or its history.)
Michal Hocko writes:
> The main problem I see with that approach is that the loop could easily
> lead to reclaim unfairness when a heavy producer which doesn't leave the
> kernel (e.g. a large read/write call) can keep a different task doing
> all the reclaim work. The loop is effectivelly unbound when there is a
> reclaim progress and so the return to the userspace is by no means
> proportional to the requested memory/charge.
It's not unbound when there is reclaim progress, it stops when we are within
the memory.high throttling grace period. Right after reclaim, we check if
penalty_jiffies is less than 10ms, and abort and further reclaim or
allocator throttling:
Just imagine that you have parallel producers increasing the high limit
excess while somebody reclaims those. Sure in practice the loop will be
bounded but the reclaimer might perform much more work on behalf of
other tasks.
A cgroup is a unit and breaking it down into "reclaim fairness" for individual
tasks like this seems suspect to me. For example, if one task in a cgroup is
leaking unreclaimable memory like crazy, everyone in that cgroup is going to be
penalised by allocator throttling as a result, even if they aren't
"responsible" for that reclaim.
So the options here are as follows when a cgroup is over memory.high and a
single reclaim isn't enough:
1. Decline further reclaim. Instead, throttle for up to 2 seconds.
2. Keep on reclaiming. Only throttle if we can't get back under memory.high.
The outcome of your suggestion to decline further reclaim is case #1, which is
significantly more practically "unfair" to that task. Throttling is extremely
disruptive to tasks and should be a last resort when we've exhausted all other
practical options. It shouldn't be something you get just because you didn't
try to reclaim hard enough.