Re: [PATCH v2 1/2] hugetlb: memcg: account hugetlb-backed memory in memory controller

Nhat Pham <nphamcs@xxxxxxxxx> · Mon, 2 Oct 2023 10:28:36 -0700

On Mon, Oct 2, 2023 at 6:43 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
>
> On Wed 27-09-23 17:57:22, Nhat Pham wrote:
> > Currently, hugetlb memory usage is not acounted for in the memory
> > controller, which could lead to memory overprotection for cgroups with
> > hugetlb-backed memory. This has been observed in our production system.
> >
> > This patch rectifies this issue by charging the memcg when the hugetlb
> > folio is allocated, and uncharging when the folio is freed (analogous to
> > the hugetlb controller).
>
> This changelog is missing a lot of information. Both about the usecase
> (we do not want to fish that out from archives in the future) and the
> actual implementation and the reasoning behind that.
>
> AFAICS you have decided to charge on the hugetlb use rather than hugetlb
> allocation to the pool. I suspect the underlying reasoning is that pool
> pages do not belong to anybody. This is a deliberate decision and it
> should be documented as such.

Yep that was the intention behind placing the charging of the hugetlb folio
in alloc_hugetlb_folio(). I'll document this in the changelog and/or code.

>
> It is also very important do describe subtle behavior properties that
> might be rather unintuitive to users. Most notably

If you don't mind, I'll summarize these into the next version of
the patch's changelog :)

> - there is no hugetlb pool management involved in the memcg
>   controller. One has to use hugetlb controller for that purpose.
>   Also the pre allocated pool as such doesn't belong to anybody so the
>   memcg host overcommit management has to consider it when configuring
>   hard limits.
> - memcg limit reclaim doesn't assist hugetlb pages allocation when
>   hugetlb overcommit is configured (i.e. pages are not consumed from the
>   pool) which means that the page allocation might disrupt workloads
>   from other memcgs.
> - failure to charge a hugetlb page results in SIGBUS rather
>   than memcg oom killer. That could be the case even if the
>   hugetlb pool still has pages available and there is
>   reclaimable memory in the memcg.

Ah yes that should be documented indeed.

> - hugetlb pages are contributing to memory reclaim protection
>   implicitly. This means that the low,min limits tunning has to consider
>   hugetlb memory as well.

This was the original inspiration for this change. I'll expand on it
in the new version's changelog.

>
> I suspect there is more than the above. To be completely honest I am
> still not convinced this is a good idea.
>
> I do recognize that this might work in a very limited environments but
> hugetlb management is quite challenging on its own and this just adds
> another layer of complexity which is really hard to see through without
> an intimate understanding of both memcg and hugetlb. The reason that
> hugetlb has been living outside of the core MM (and memcg) is not just
> because we like it that way. And yes I do fully understand that users
> shouldn't really care about that because this is just a memory to them.
>
> We should also consider the global control for this functionality. I am
> especially worried about setups where a mixed bag of workloads
> (containers) is executed. While some of them will be ready for the new
> accounting mode many will leave in their own world without ever being
> modified. How do we deal with that situation?

Johannes already responded to this, but I also think this hypothetical
situation isn't super urgent to handle right now. That said, we can
always revisit it if/when it proves to be an issue and add appropriate
memcg-specific control for this feature as a follow-up.

>
> All that being said, I am not going to ack nor nack this but I really do
> prefer to be much more explicit about the motivation and current
> implementation specifics so that we can forward users to something
> they can digest.
>
> > Signed-off-by: Nhat Pham <nphamcs@xxxxxxxxx>
> [...]
>
> a minor implementation detail below. I couldn't spot anything obviously
> broken with the rest of the hugetlb specific code. restore_reserve_on_memcg_failure
> is rather clumsy and potentially error prone but I will leave that out
> to Mike as he is much more familiar with that behavior than me.

That part irks me too, but I'm trying to follow the error handling logic
that follows each alloc_hugetlb_folio() call site.

If anyone has any suggestions, I'd be happy to listen!

>
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index de220e3ff8be..ff88ea4df11a 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> [...]
> > @@ -3119,6 +3121,15 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> >                       hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
> >                                       pages_per_huge_page(h), folio);
> >       }
> > +
> > +     /* undo allocation if memory controller disallows it. */
> > +     if (mem_cgroup_hugetlb_charge_folio(folio, GFP_KERNEL)) {
>
> htlb_alloc_mask(h) rather than GFP_KERNEL. Ideally with
> __GFP_RETRY_MAYFAIL which is a default allocation policy.

Oh I wasn't aware of htlb_alloc_mask(h). So I'll fix this to:

htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL

>
> > +             if (restore_reserve_on_memcg_failure)
> > +                     restore_reserve_on_error(h, vma, addr, folio);
> > +             folio_put(folio);
> > +             return ERR_PTR(-ENOMEM);
> > +     }
> > +
> >       return folio;
> >
> >  out_uncharge_cgroup:
>
> --
> Michal Hocko
> SUSE Labs