Re: [PATCH v3 2/3] hugetlb: memcg: account hugetlb-backed memory in memory controller

Nhat Pham <nphamcs@xxxxxxxxx> · Mon, 2 Oct 2023 17:26:09 -0700



On Mon, Oct 2, 2023 at 5:18 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
>
> Currently, hugetlb memory usage is not acounted for in the memory
> controller, which could lead to memory overprotection for cgroups with
> hugetlb-backed memory. This has been observed in our production system.
>
> For instance, here is one of our usecases: suppose there are two 32G
> containers. The machine is booted with hugetlb_cma=6G, and each
> container may or may not use up to 3 gigantic page, depending on the
> workload within it. The rest is anon, cache, slab, etc. We can set the
> hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness.
> But it is very difficult to configure memory.max to keep overall
> consumption, including anon, cache, slab etc. fair.
>
> What we have had to resort to is to constantly poll hugetlb usage and
> readjust memory.max. Similar procedure is done to other memory limits
> (memory.low for e.g). However, this is rather cumbersome and buggy.
> Furthermore, when there is a delay in memory limits correction, (for e.g
> when hugetlb usage changes within consecutive runs of the userspace
> agent), the system could be in an over/underprotected state.
>
> This patch rectifies this issue by charging the memcg when the hugetlb
> folio is utilized, and uncharging when the folio is freed (analogous to
> the hugetlb controller). Note that we do not charge when the folio is
> allocated to the hugetlb pool, because at this point it is not owned by
> any memcg.
>
> Some caveats to consider:
>   * This feature is only available on cgroup v2.
>   * There is no hugetlb pool management involved in the memory
>     controller. As stated above, hugetlb folios are only charged towards
>     the memory controller when it is used. Host overcommit management
>     has to consider it when configuring hard limits.
>   * Failure to charge towards the memcg results in SIGBUS. This could
>     happen even if the hugetlb pool still has pages (but the cgroup
>     limit is hit and reclaim attempt fails).
>   * When this feature is enabled, hugetlb pages contribute to memory
>     reclaim protection. low, min limits tuning must take into account
>     hugetlb memory.
>   * Hugetlb pages utilized while this option is not selected will not
>     be tracked by the memory controller (even if cgroup v2 is remounted
>     later on).
(special thanks to Michal Hocko, who pointed most of these out).
>
> Signed-off-by: Nhat Pham <nphamcs@xxxxxxxxx>
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 29 ++++++++++++++++++++
>  include/linux/cgroup-defs.h             |  5 ++++
>  include/linux/memcontrol.h              |  9 +++++++
>  kernel/cgroup/cgroup.c                  | 15 ++++++++++-
>  mm/hugetlb.c                            | 35 ++++++++++++++++++++-----
>  mm/memcontrol.c                         | 35 +++++++++++++++++++++++++
>  6 files changed, 120 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 622a7f28db1f..606b2e0eac4b 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -210,6 +210,35 @@ cgroup v2 currently supports the following mount options.
>          relying on the original semantics (e.g. specifying bogusly
>          high 'bypass' protection values at higher tree levels).
>
> +  memory_hugetlb_accounting
> +        Count HugeTLB memory usage towards the cgroup's overall
> +        memory usage for the memory controller (for the purpose of
> +        statistics reporting and memory protetion). This is a new
> +        behavior that could regress existing setups, so it must be
> +        explicitly opted in with this mount option.
> +
> +        A few caveats to keep in mind:
> +
> +        * There is no HugeTLB pool management involved in the memory
> +          controller. The pre-allocated pool does not belong to anyone.
> +          Specifically, when a new HugeTLB folio is allocated to
> +          the pool, it is not accounted for from the perspective of the
> +          memory controller. It is only charged to a cgroup when it is
> +          actually used (for e.g at page fault time). Host memory
> +          overcommit management has to consider this when configuring
> +          hard limits. In general, HugeTLB pool management should be
> +          done via other mechanisms (such as the HugeTLB controller).
> +        * Failure to charge a HugeTLB folio to the memory controller
> +          results in SIGBUS. This could happen even if the HugeTLB pool
> +          still has pages available (but the cgroup limit is hit and
> +          reclaim attempt fails).
> +        * Charging HugeTLB memory towards the memory controller affects
> +          memory protection and reclaim dynamics. Any userspace tuning
> +          (of low, min limits for e.g) needs to take this into account.
> +        * HugeTLB pages utilized while this option is not selected
> +          will not be tracked by the memory controller (even if cgroup
> +          v2 is remounted later on).
> +
>
>  Organizing Processes and Threads
>  --------------------------------
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index f1b3151ac30b..8641f4320c98 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -115,6 +115,11 @@ enum {
>          * Enable recursive subtree protection
>          */
>         CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 18),
> +
> +       /*
> +        * Enable hugetlb accounting for the memory controller.
> +        */
> +        CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING = (1 << 19),
>  };
>
>  /* cftype->flags */
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 42bf7e9b1a2f..a827e2129790 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -679,6 +679,9 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
>         return __mem_cgroup_charge(folio, mm, gfp);
>  }
>
> +int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
> +               long nr_pages);
> +
>  int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
>                                   gfp_t gfp, swp_entry_t entry);
>  void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
> @@ -1262,6 +1265,12 @@ static inline int mem_cgroup_charge(struct folio *folio,
>         return 0;
>  }
>
> +static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg,
> +               gfp_t gfp, long nr_pages)
> +{
> +       return 0;
> +}
> +
>  static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
>                         struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
>  {
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 1fb7f562289d..f11488b18ceb 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -1902,6 +1902,7 @@ enum cgroup2_param {
>         Opt_favordynmods,
>         Opt_memory_localevents,
>         Opt_memory_recursiveprot,
> +       Opt_memory_hugetlb_accounting,
>         nr__cgroup2_params
>  };
>
> @@ -1910,6 +1911,7 @@ static const struct fs_parameter_spec cgroup2_fs_parameters[] = {
>         fsparam_flag("favordynmods",            Opt_favordynmods),
>         fsparam_flag("memory_localevents",      Opt_memory_localevents),
>         fsparam_flag("memory_recursiveprot",    Opt_memory_recursiveprot),
> +       fsparam_flag("memory_hugetlb_accounting", Opt_memory_hugetlb_accounting),
>         {}
>  };
>
> @@ -1936,6 +1938,9 @@ static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param
>         case Opt_memory_recursiveprot:
>                 ctx->flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT;
>                 return 0;
> +       case Opt_memory_hugetlb_accounting:
> +               ctx->flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
> +               return 0;
>         }
>         return -EINVAL;
>  }
> @@ -1960,6 +1965,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
>                         cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT;
>                 else
>                         cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_RECURSIVE_PROT;
> +
> +               if (root_flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
> +                       cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
> +               else
> +                       cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
>         }
>  }
>
> @@ -1973,6 +1983,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
>                 seq_puts(seq, ",memory_localevents");
>         if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT)
>                 seq_puts(seq, ",memory_recursiveprot");
> +       if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
> +               seq_puts(seq, ",memory_hugetlb_accounting");
>         return 0;
>  }
>
> @@ -7050,7 +7062,8 @@ static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr,
>                         "nsdelegate\n"
>                         "favordynmods\n"
>                         "memory_localevents\n"
> -                       "memory_recursiveprot\n");
> +                       "memory_recursiveprot\n"
> +                       "memory_hugetlb_accounting\n");
>  }
>  static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features);
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index de220e3ff8be..74472e911b0a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1902,6 +1902,7 @@ void free_huge_folio(struct folio *folio)
>                                      pages_per_huge_page(h), folio);
>         hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
>                                           pages_per_huge_page(h), folio);
> +       mem_cgroup_uncharge(folio);
>         if (restore_reserve)
>                 h->resv_huge_pages++;
>
> @@ -3009,11 +3010,20 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>         struct hugepage_subpool *spool = subpool_vma(vma);
>         struct hstate *h = hstate_vma(vma);
>         struct folio *folio;
> -       long map_chg, map_commit;
> +       long map_chg, map_commit, nr_pages = pages_per_huge_page(h);
>         long gbl_chg;
> -       int ret, idx;
> +       int memcg_charge_ret, ret, idx;
>         struct hugetlb_cgroup *h_cg = NULL;
> +       struct mem_cgroup *memcg;
>         bool deferred_reserve;
> +       gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
> +
> +       memcg = get_mem_cgroup_from_current();
> +       memcg_charge_ret = mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages);
> +       if (memcg_charge_ret == -ENOMEM) {
> +               mem_cgroup_put(memcg);
> +               return ERR_PTR(-ENOMEM);
> +       }
>
>         idx = hstate_index(h);
>         /*
> @@ -3022,8 +3032,12 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>          * code of zero indicates a reservation exists (no change).
>          */
>         map_chg = gbl_chg = vma_needs_reservation(h, vma, addr);
> -       if (map_chg < 0)
> +       if (map_chg < 0) {
> +               if (!memcg_charge_ret)
> +                       mem_cgroup_cancel_charge(memcg, nr_pages);
> +               mem_cgroup_put(memcg);
>                 return ERR_PTR(-ENOMEM);
> +       }
>
>         /*
>          * Processes that did not create the mapping will have no
> @@ -3034,10 +3048,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>          */
>         if (map_chg || avoid_reserve) {
>                 gbl_chg = hugepage_subpool_get_pages(spool, 1);
> -               if (gbl_chg < 0) {
> -                       vma_end_reservation(h, vma, addr);
> -                       return ERR_PTR(-ENOSPC);
> -               }
> +               if (gbl_chg < 0)
> +                       goto out_end_reservation;
>
>                 /*
>                  * Even though there was no reservation in the region/reserve
> @@ -3119,6 +3131,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>                         hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
>                                         pages_per_huge_page(h), folio);
>         }
> +
> +       if (!memcg_charge_ret)
> +               mem_cgroup_commit_charge(folio, memcg);
> +       mem_cgroup_put(memcg);
> +
>         return folio;
>
>  out_uncharge_cgroup:
> @@ -3130,7 +3147,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  out_subpool_put:
>         if (map_chg || avoid_reserve)
>                 hugepage_subpool_put_pages(spool, 1);
> +out_end_reservation:
>         vma_end_reservation(h, vma, addr);
> +       if (!memcg_charge_ret)
> +               mem_cgroup_cancel_charge(memcg, nr_pages);
> +       mem_cgroup_put(memcg);
>         return ERR_PTR(-ENOSPC);
>  }
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 0219befeae38..6660684f6f97 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7085,6 +7085,41 @@ int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp)
>         return ret;
>  }
>
> +/**
> + * mem_cgroup_hugetlb_try_charge - try to charge the memcg for a hugetlb folio
> + * @memcg: memcg to charge.
> + * @gfp: reclaim mode.
> + * @nr_pages: number of pages to charge.
> + *
> + * This function is called when allocating a huge page folio to determine if
> + * the memcg has the capacity for it. It does not commit the charge yet,
> + * as the hugetlb folio itself has not been obtained from the hugetlb pool.
> + *
> + * Once we have obtained the hugetlb folio, we can call
> + * mem_cgroup_commit_charge() to commit the charge. If we fail to obtain the
> + * folio, we should instead call mem_cgroup_cancel_charge() to undo the effect
> + * of try_charge().
> + *
> + * Returns 0 on success. Otherwise, an error code is returned.
> + */
> +int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
> +                       long nr_pages)
> +{
> +       /*
> +        * If hugetlb memcg charging is not enabled, do not fail hugetlb allocation,
> +        * but do not attempt to commit charge later (or cancel on error) either.
> +        */
> +       if (mem_cgroup_disabled() || !memcg ||
> +               !cgroup_subsys_on_dfl(memory_cgrp_subsys) ||
> +               !(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING))
> +               return -EOPNOTSUPP;
> +
> +       if (try_charge(memcg, gfp, nr_pages))
> +               return -ENOMEM;
> +
> +       return 0;
> +}
> +
>  /**
>   * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for swapin.
>   * @folio: folio to charge.
> --
> 2.34.1