On Mon, Oct 02, 2023 at 05:18:27PM -0700, Nhat Pham wrote: > Currently, hugetlb memory usage is not acounted for in the memory > controller, which could lead to memory overprotection for cgroups with > hugetlb-backed memory. This has been observed in our production system. > > For instance, here is one of our usecases: suppose there are two 32G > containers. The machine is booted with hugetlb_cma=6G, and each > container may or may not use up to 3 gigantic page, depending on the > workload within it. The rest is anon, cache, slab, etc. We can set the > hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness. > But it is very difficult to configure memory.max to keep overall > consumption, including anon, cache, slab etc. fair. > > What we have had to resort to is to constantly poll hugetlb usage and > readjust memory.max. Similar procedure is done to other memory limits > (memory.low for e.g). However, this is rather cumbersome and buggy. > Furthermore, when there is a delay in memory limits correction, (for e.g > when hugetlb usage changes within consecutive runs of the userspace > agent), the system could be in an over/underprotected state. > > This patch rectifies this issue by charging the memcg when the hugetlb > folio is utilized, and uncharging when the folio is freed (analogous to > the hugetlb controller). Note that we do not charge when the folio is > allocated to the hugetlb pool, because at this point it is not owned by > any memcg. > > Some caveats to consider: > * This feature is only available on cgroup v2. > * There is no hugetlb pool management involved in the memory > controller. As stated above, hugetlb folios are only charged towards > the memory controller when it is used. Host overcommit management > has to consider it when configuring hard limits. > * Failure to charge towards the memcg results in SIGBUS. This could > happen even if the hugetlb pool still has pages (but the cgroup > limit is hit and reclaim attempt fails). > * When this feature is enabled, hugetlb pages contribute to memory > reclaim protection. low, min limits tuning must take into account > hugetlb memory. > * Hugetlb pages utilized while this option is not selected will not > be tracked by the memory controller (even if cgroup v2 is remounted > later on). > > Signed-off-by: Nhat Pham <nphamcs@xxxxxxxxx> Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx>