On Tue, Oct 03, 2023 at 02:58:58PM +0200, Michal Hocko wrote: > On Mon 02-10-23 17:18:27, Nhat Pham wrote: > > Currently, hugetlb memory usage is not acounted for in the memory > > controller, which could lead to memory overprotection for cgroups with > > hugetlb-backed memory. This has been observed in our production system. > > > > For instance, here is one of our usecases: suppose there are two 32G > > containers. The machine is booted with hugetlb_cma=6G, and each > > container may or may not use up to 3 gigantic page, depending on the > > workload within it. The rest is anon, cache, slab, etc. We can set the > > hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness. > > But it is very difficult to configure memory.max to keep overall > > consumption, including anon, cache, slab etc. fair. > > > > What we have had to resort to is to constantly poll hugetlb usage and > > readjust memory.max. Similar procedure is done to other memory limits > > (memory.low for e.g). However, this is rather cumbersome and buggy. > > Could you expand some more on how this _helps_ memory.low? The > hugetlb memory is not reclaimable so whatever portion of its memcg > consumption will be "protected from the reclaim". Consider this > parent > / \ > A B > low=50% low=0 > current=40% current=60% > > We have an external memory pressure and the reclaim should prefer B as A > is under its low limit, correct? But now consider that the predominant > consumption of B is hugetlb which would mean the memory reclaim cannot > do much for B and so the A's protection might be breached. > > As an admin (or a tool) you need to know about hugetlb as a potential > contributor to this behavior (sure mlocked memory would behave the same > but mlock rarely consumes huge amount of memory in my experience). > Without the accounting there might not be any external pressure in the > first place. > > All that being said, I do not see how adding hugetlb into accounting > makes low, min limits management any easier. It's important to differentiate the cgroup usecases. One is of course the cloud/virtual server scenario, where you set the hard limits to whatever the customer paid for, and don't know and don't care about the workload running inside. In that case, memory.low and overcommit aren't really safe to begin with due to unknown unreclaimable mem. The other common usecase is the datacenter where you run your own applications. You understand their workingset and requirements, and configure and overcommit the containers in a way where jobs always meet their SLAs. E.g. if multiple containers spike, memory.low is set such that interactive workloads are prioritized over batch jobs, and both have priority over routine system management tasks. This is arguably the only case where it's safe to use memory.low. You have to know what's reclaimable and what isn't, otherwise you cannot know that memory.low will even do anything, and isolation breaks down. So we already have that knowledge: mlocked sections, how much anon is without swap space, and how much memory must not be reclaimed (even if it is reclaimable) for the workload to meet its SLAs. Hugetlb doesn't really complicate this equation - we already have to consider it unreclaimable workingset from an overcommit POV on those hosts. The reason this patch helps in this scenario is that the service teams are usually different from the containers/infra team. The service understands their workload and declares its workingset. But it's the infra team running the containers that currently has to go and find out if they're using hugetlb and tweak the cgroups. Bugs and untimeliness in the tweaking have caused multiple production incidents already. And both teams are regularly confused when there are large parts of the workload that don't show up in memory.current which both sides monitor. Keep in mind that these systems are already pretty complex, with multiple overcommitted containers and system-level activity. The current hugetlb quirk can heavily distort what a given container is doing on the host. With this patch, the service can declare its workingset, the container team can configure the container, and memory.current makes sense to everybody. The workload parameters are pretty reliable, but if the service team gets it wrong and we underprotect the workload, and/or its unreclaimable memory exceeds what was declared, the infra team gets alarms on elevated LOW breaching events and investigates if its an infra problem or a service spec problem that needs escalation. So the case you describe above only happens when mistakes are made, and we detect and rectify them. In the common case, hugetlb is part of the recognized workingset, and we configure memory.low to cut off only known optional and reclaimable memory under pressure.