On Mon, Oct 2, 2023 at 6:43 AM Michal Hocko <mhocko@xxxxxxxx> wrote: > > On Wed 27-09-23 17:57:22, Nhat Pham wrote: > > Currently, hugetlb memory usage is not acounted for in the memory > > controller, which could lead to memory overprotection for cgroups with > > hugetlb-backed memory. This has been observed in our production system. > > > > This patch rectifies this issue by charging the memcg when the hugetlb > > folio is allocated, and uncharging when the folio is freed (analogous to > > the hugetlb controller). > > This changelog is missing a lot of information. Both about the usecase > (we do not want to fish that out from archives in the future) and the > actual implementation and the reasoning behind that. > > AFAICS you have decided to charge on the hugetlb use rather than hugetlb > allocation to the pool. I suspect the underlying reasoning is that pool > pages do not belong to anybody. This is a deliberate decision and it > should be documented as such. Yep that was the intention behind placing the charging of the hugetlb folio in alloc_hugetlb_folio(). I'll document this in the changelog and/or code. > > It is also very important do describe subtle behavior properties that > might be rather unintuitive to users. Most notably If you don't mind, I'll summarize these into the next version of the patch's changelog :) > - there is no hugetlb pool management involved in the memcg > controller. One has to use hugetlb controller for that purpose. > Also the pre allocated pool as such doesn't belong to anybody so the > memcg host overcommit management has to consider it when configuring > hard limits. > - memcg limit reclaim doesn't assist hugetlb pages allocation when > hugetlb overcommit is configured (i.e. pages are not consumed from the > pool) which means that the page allocation might disrupt workloads > from other memcgs. > - failure to charge a hugetlb page results in SIGBUS rather > than memcg oom killer. That could be the case even if the > hugetlb pool still has pages available and there is > reclaimable memory in the memcg. Ah yes that should be documented indeed. > - hugetlb pages are contributing to memory reclaim protection > implicitly. This means that the low,min limits tunning has to consider > hugetlb memory as well. This was the original inspiration for this change. I'll expand on it in the new version's changelog. > > I suspect there is more than the above. To be completely honest I am > still not convinced this is a good idea. > > I do recognize that this might work in a very limited environments but > hugetlb management is quite challenging on its own and this just adds > another layer of complexity which is really hard to see through without > an intimate understanding of both memcg and hugetlb. The reason that > hugetlb has been living outside of the core MM (and memcg) is not just > because we like it that way. And yes I do fully understand that users > shouldn't really care about that because this is just a memory to them. > > We should also consider the global control for this functionality. I am > especially worried about setups where a mixed bag of workloads > (containers) is executed. While some of them will be ready for the new > accounting mode many will leave in their own world without ever being > modified. How do we deal with that situation? Johannes already responded to this, but I also think this hypothetical situation isn't super urgent to handle right now. That said, we can always revisit it if/when it proves to be an issue and add appropriate memcg-specific control for this feature as a follow-up. > > All that being said, I am not going to ack nor nack this but I really do > prefer to be much more explicit about the motivation and current > implementation specifics so that we can forward users to something > they can digest. > > > Signed-off-by: Nhat Pham <nphamcs@xxxxxxxxx> > [...] > > a minor implementation detail below. I couldn't spot anything obviously > broken with the rest of the hugetlb specific code. restore_reserve_on_memcg_failure > is rather clumsy and potentially error prone but I will leave that out > to Mike as he is much more familiar with that behavior than me. That part irks me too, but I'm trying to follow the error handling logic that follows each alloc_hugetlb_folio() call site. If anyone has any suggestions, I'd be happy to listen! > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index de220e3ff8be..ff88ea4df11a 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > [...] > > @@ -3119,6 +3121,15 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > > hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h), > > pages_per_huge_page(h), folio); > > } > > + > > + /* undo allocation if memory controller disallows it. */ > > + if (mem_cgroup_hugetlb_charge_folio(folio, GFP_KERNEL)) { > > htlb_alloc_mask(h) rather than GFP_KERNEL. Ideally with > __GFP_RETRY_MAYFAIL which is a default allocation policy. Oh I wasn't aware of htlb_alloc_mask(h). So I'll fix this to: htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL > > > + if (restore_reserve_on_memcg_failure) > > + restore_reserve_on_error(h, vma, addr, folio); > > + folio_put(folio); > > + return ERR_PTR(-ENOMEM); > > + } > > + > > return folio; > > > > out_uncharge_cgroup: > > -- > Michal Hocko > SUSE Labs