Michal Hocko writes:
A cgroup is a unit and breaking it down into "reclaim fairness" for
individual tasks like this seems suspect to me. For example, if one task in
a cgroup is leaking unreclaimable memory like crazy, everyone in that cgroup
is going to be penalised by allocator throttling as a result, even if they
aren't "responsible" for that reclaim.
You are right, but that doesn't mean that it is desirable that some
tasks would be throttled unexpectedly too long because of the other's activity.
Are you really talking about throttling, or reclaim? If throttling, tasks are
already throttled proportionally to how much this allocation is contributing to
the overage in calculate_high_delay.
If you're talking about reclaim, trying to reason about whether the overage is
the result of some other task in this cgroup or the task that's allocating
right now is something that we already know doesn't work well (eg. global OOM).
So the options here are as follows when a cgroup is over memory.high and a
single reclaim isn't enough:
1. Decline further reclaim. Instead, throttle for up to 2 seconds.
2. Keep on reclaiming. Only throttle if we can't get back under memory.high.
The outcome of your suggestion to decline further reclaim is case #1, which
is significantly more practically "unfair" to that task. Throttling is
extremely disruptive to tasks and should be a last resort when we've
exhausted all other practical options. It shouldn't be something you get
just because you didn't try to reclaim hard enough.
I believe I have asked in other email in this thread. Could you explain
why enforcint the requested target (memcg_nr_pages_over_high) is
insufficient for the problem you are dealing with? Because that would
make sense for large targets to me while it would keep relatively
reasonable semantic of the throttling - aka proportional to the memory
demand rather than the excess.
memcg_nr_pages_over_high is related to the charge size. As such, if you're way
over memory.high as a result of transient reclaim failures, but the majority of
your charges are small, it's going to hard to make meaningful progress:
1. Most nr_pages will be MEMCG_CHARGE_BATCH, which is not enough to help;
2. Large allocations will only get a single reclaim attempt to succeed.
As such, in many cases we're either doomed to successfully reclaim a paltry
amount of pages, or fail to reclaim a lot of pages. Asking try_to_free_pages()
to deal with those huge allocations is generally not reasonable, regardless of
the specifics of why it doesn't work in this case.