Hey Michal, On Wed, Dec 14, 2022 at 04:29:06PM +0100, Michal Hocko wrote: > On Wed 14-12-22 13:40:33, Johannes Weiner wrote: > > The only way to prevent cgroups from disrupting each other on NUMA > > nodes is NUMA constraints. Cgroup per-node limits. That shields not > > only from demotion, but also from DoS-mbinding, or aggressive > > promotion. All of these can result in some form of premature > > reclaim/demotion, proactive demotion isn't special in that way. > > Any numa based balancing is a real challenge with memcg semantic. I do > not see per numa node memcg limits without a major overhaul of how we do > charging though. I am not sure this is on the table even long term. > Unless I am really missing something here we have to live with the > existing semantic for a foreseeable future. Yes, I think you're quite right. We've been mostly skirting the NUMA issue in cgroups (and to a degree in MM code in general) with two possible answers: a) The NUMA distances are close enough that we ignore it and pretend all memory is (mostly) fungible. b) The NUMA distances are big enough that it matters, in which case the best option is to avoid sharing, and use bindings to keep workloads/containers isolated to their own CPU+memory domains. Tiered memory forces the issue by providing memory that must be shared between workloads/containers, but is not fungible. At least not without incurring priority inversions between containers, where a lopri container promotes itself to the top and demotes the hipri workload, while staying happily within its global memory allowance. This applies to mbind() cases as much as it does to NUMA balancing. If these setups proliferate, it seems inevitable to me that sooner or later the full problem space of memory cgroups - dividing up a shared resource while allowing overcommit - applies not just to "RAM as a whole", but to each memory tier individually. Whether we need the full memcg interface per tier or per node, I'm not sure. It might be enough to automatically apportion global allowances to nodes; so if you have 32G toptier and 16G lowtier, and a cgroup has a 20G allowance, it gets 13G on top and 7G on low. (That, or we settle on multi-socket systems with private tiers, such that memory continues to be unshared :-) Either way, I expect this issue will keep coming up as we try to use containers on such systems.