On Wed 27-09-23 17:57:22, Nhat Pham wrote: > Currently, hugetlb memory usage is not acounted for in the memory > controller, which could lead to memory overprotection for cgroups with > hugetlb-backed memory. This has been observed in our production system. > > This patch rectifies this issue by charging the memcg when the hugetlb > folio is allocated, and uncharging when the folio is freed (analogous to > the hugetlb controller). This changelog is missing a lot of information. Both about the usecase (we do not want to fish that out from archives in the future) and the actual implementation and the reasoning behind that. AFAICS you have decided to charge on the hugetlb use rather than hugetlb allocation to the pool. I suspect the underlying reasoning is that pool pages do not belong to anybody. This is a deliberate decision and it should be documented as such. It is also very important do describe subtle behavior properties that might be rather unintuitive to users. Most notably - there is no hugetlb pool management involved in the memcg controller. One has to use hugetlb controller for that purpose. Also the pre allocated pool as such doesn't belong to anybody so the memcg host overcommit management has to consider it when configuring hard limits. - memcg limit reclaim doesn't assist hugetlb pages allocation when hugetlb overcommit is configured (i.e. pages are not consumed from the pool) which means that the page allocation might disrupt workloads from other memcgs. - failure to charge a hugetlb page results in SIGBUS rather than memcg oom killer. That could be the case even if the hugetlb pool still has pages available and there is reclaimable memory in the memcg. - hugetlb pages are contributing to memory reclaim protection implicitly. This means that the low,min limits tunning has to consider hugetlb memory as well. I suspect there is more than the above. To be completely honest I am still not convinced this is a good idea. I do recognize that this might work in a very limited environments but hugetlb management is quite challenging on its own and this just adds another layer of complexity which is really hard to see through without an intimate understanding of both memcg and hugetlb. The reason that hugetlb has been living outside of the core MM (and memcg) is not just because we like it that way. And yes I do fully understand that users shouldn't really care about that because this is just a memory to them. We should also consider the global control for this functionality. I am especially worried about setups where a mixed bag of workloads (containers) is executed. While some of them will be ready for the new accounting mode many will leave in their own world without ever being modified. How do we deal with that situation? All that being said, I am not going to ack nor nack this but I really do prefer to be much more explicit about the motivation and current implementation specifics so that we can forward users to something they can digest. > Signed-off-by: Nhat Pham <nphamcs@xxxxxxxxx> [...] a minor implementation detail below. I couldn't spot anything obviously broken with the rest of the hugetlb specific code. restore_reserve_on_memcg_failure is rather clumsy and potentially error prone but I will leave that out to Mike as he is much more familiar with that behavior than me. > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index de220e3ff8be..ff88ea4df11a 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c [...] > @@ -3119,6 +3121,15 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h), > pages_per_huge_page(h), folio); > } > + > + /* undo allocation if memory controller disallows it. */ > + if (mem_cgroup_hugetlb_charge_folio(folio, GFP_KERNEL)) { htlb_alloc_mask(h) rather than GFP_KERNEL. Ideally with __GFP_RETRY_MAYFAIL which is a default allocation policy. > + if (restore_reserve_on_memcg_failure) > + restore_reserve_on_error(h, vma, addr, folio); > + folio_put(folio); > + return ERR_PTR(-ENOMEM); > + } > + > return folio; > > out_uncharge_cgroup: -- Michal Hocko SUSE Labs