TL;DR I see merit in the recursive propagation if it's requested explicitly (i.e. retaining meaining of 0). The protection/weight semantics should be refined. On Wed, Feb 26, 2020 at 10:05:48AM -0500, Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > They still ultimately translate to real resources. The concrete value > depends on what the parent's weight translates to, and it depends on > sibling configurations and their current consumption. (All of this is > already true for memory protection as well, btw). But eventually, a > weight specification translates to actual time on a CPU, bandwidth on > an IO device etc. > > > - sum of sibling weights is meaningless (and independent from parent > > weight) > > Technically true for overcommitted memory.low values as well. Yes, but for overcommited only. For pure weights it doesn't matter if you set 1:10, 10:100 or 100:1000, however, for the protection it has this behavior only when approaching infinity and as the sum compares to parent's value, the protection behaves differently. [If there had to be to some pure memory weights, those would for instance express relative affinity of group's pages to physical memory.] > I don't see a fundamental difference between them. And that in turn > makes it hard for me to accept that hierarchical inheritance rules > should be different. I'll try coming up with some better examples for the difference that I perceive. > "Wrong" isn't the right term. Is it what you wanted to express in your > configuration? I want to express absolute amount of memory (ideally representing workingset size) under protection. IIUC, you want to express general relative priorities of B vs C when some outer metric has to be maintained given you reach both limits of memory and IO. > You are talking about a mathematical truth on a per-controller > basis. What I'm saying is that I don't see how this is useful for real > workloads, their relative priorities, and the performance expectations > users have from these priorities. > With a priority inversion like this, there is no actual performance > isolation or containerization going on here - which is the whole point > of cgroups and resource control. I acknowledge that by pressing too much along one dimension (memory) you induce expansion in other dimension (IO) and that may become noticable in siblings (expansion over saturation [1]). But that's expected when only weights are in use. If you wanted to hide the effect of workload B to C, B would need real limit. [I beg to disagree that containerization is whole point of cgroups, it's large part of it, hence the isolation needn't be necessarily bi-directional.] > My objection is to opting out of protection against cousins (thus > overriding parental resource assignment), not against siblings. Just to sync up the terminology - I'm calling this protection against uncles (the composition/structure under them is irrelevant). And the limitation comes from grandparent or higher (or global). ...and the overriden parental resource assignment is the expansion on non-memory dimension (IO/CPU). > Correct, but you can change the tree to this: > > A.low=10G > `- A1.low=10G > `- B.low=0G > `- C.low=0G > `- D.low=0G > > to express > > A1 > D > B = C That sort of works (if I give up the scapegoat). Although I have trouble that I have to copy the value from A to A1, I could have done that with previous hierarchy and simply set B.low=C.low=10G. > That is, I would like to see an argument for this setup: > > A > `- B io.weight=200 memory.low=10G > `- D io.weight=100 (e.g.) memory.low=10G > `- E io.weight=100 (e.g.) memory.low=0 > `- C io.weight=50 memory.low=5G > > Where E has no memory protection against C, but E has IO priority over > C. That's the configuration that cannot be expressed with a recursive > memory.low, but since it involves priority inversions it's not useful > to actually isolate and containerize workloads. But there can be no cousin (uncle) or more precisely it's the global rest that we don't mind to affect. > > I'd say that protected memory is a disposable resource in contrast with > > CPU/IO. If you don't have latter, you don't progress; if you lack the > > former, you are refaulting but can make progress. Even more, you should > > be able to give up memory.min. > > Eh, I'm not buying that. You cannot run without memory either. If > somebody reclaims a page between you faulting it in and you resuming > to userspace, there is no forward progress. I made a hasty argument (misinterpretting the constant outer reclaim pressure). So that wasn't the fundamental difference. The second part -- memory.min is subject to equal calculation as memory.low. Do you find the scape goat preventing OOM in grand-parent (or higher) subtree also a misfeature/artifact? Thanks, Michal [1] Here I'm taking your/Tejun's assumption that in the stressful situations it always boils down to IO, although I don't have any quantitative arguments for that.
Attachment:
signature.asc
Description: Digital signature