Re: [External] Re: [PATCH v4] mm/hugetlb: add mempolicy check in the reservation routine

Muchun Song <songmuchun@xxxxxxxxxxxxx> · Tue, 28 Jul 2020 22:16:29 +0800

On Tue, Jul 28, 2020 at 9:25 PM Baoquan He <bhe@xxxxxxxxxx> wrote:
>
> Hi Muchun,
>
> On 07/28/20 at 11:49am, Muchun Song wrote:
> > In the reservation routine, we only check whether the cpuset meets
> > the memory allocation requirements. But we ignore the mempolicy of
> > MPOL_BIND case. If someone mmap hugetlb succeeds, but the subsequent
> > memory allocation may fail due to mempolicy restrictions and receives
> > the SIGBUS signal. This can be reproduced by the follow steps.
> >
> >  1) Compile the test case.
> >     cd tools/testing/selftests/vm/
> >     gcc map_hugetlb.c -o map_hugetlb
> >
> >  2) Pre-allocate huge pages. Suppose there are 2 numa nodes in the
> >     system. Each node will pre-allocate one huge page.
> >     echo 2 > /proc/sys/vm/nr_hugepages
> >
> >  3) Run test case(mmap 4MB). We receive the SIGBUS signal.
> >     numactl --membind=0 ./map_hugetlb 4
>
> I think supporting the  mempolicy of MPOL_BIND case is a good idea.
> I am wondering what about the other mempolicy cases, e.g MPOL_INTERLEAVE,
> MPOL_PREFERRED. Asking these because we already have similar handling in
> sysfs, proc nr_hugepages_mempolicy writting. Please see
> __nr_hugepages_store_common() for detail.

Yeah, I know the nr_hugepages_mempolicy. But this new code will
help produce a quick failure as described in the commit message
instead of waiting until the page fault routine(and receive a SIGBUG
signal).

>
> Thanks
> Baoquan
>
> >
> > With this patch applied, the mmap will fail in the step 3) and throw
> > "mmap: Cannot allocate memory".
> >
> > Signed-off-by: Muchun Song <songmuchun@xxxxxxxxxxxxx>
> > Reported-by: Jianchao Guo <guojianchao@xxxxxxxxxxxxx>
> > Suggested-by: Michal Hocko <mhocko@xxxxxxxxxx>
> > Reviewed-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
> > ---
> > changelog in v4:
> >  1) Fix compilation errors with !CONFIG_NUMA.
> >
> > changelog in v3:
> >  1) Do not allocate nodemask on the stack.
> >  2) Update comment.
> >
> > changelog in v2:
> >  1) Reuse policy_nodemask().
> >
> >  include/linux/mempolicy.h | 14 ++++++++++++++
> >  mm/hugetlb.c              | 22 ++++++++++++++++++----
> >  mm/mempolicy.c            |  2 +-
> >  3 files changed, 33 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> > index ea9c15b60a96..0656ece1ccf1 100644
> > --- a/include/linux/mempolicy.h
> > +++ b/include/linux/mempolicy.h
> > @@ -152,6 +152,15 @@ extern int huge_node(struct vm_area_struct *vma,
> >  extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
> >  extern bool mempolicy_nodemask_intersects(struct task_struct *tsk,
> >                               const nodemask_t *mask);
> > +extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy);
> > +
> > +static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
> > +{
> > +     struct mempolicy *mpol = get_task_policy(current);
> > +
> > +     return policy_nodemask(gfp, mpol);
> > +}
> > +
> >  extern unsigned int mempolicy_slab_node(void);
> >
> >  extern enum zone_type policy_zone;
> > @@ -281,5 +290,10 @@ static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
> >  static inline void mpol_put_task_policy(struct task_struct *task)
> >  {
> >  }
> > +
> > +static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
> > +{
> > +     return NULL;
> > +}
> >  #endif /* CONFIG_NUMA */
> >  #endif
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 589c330df4db..a34458f6a475 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -3463,13 +3463,21 @@ static int __init default_hugepagesz_setup(char *s)
> >  }
> >  __setup("default_hugepagesz=", default_hugepagesz_setup);
> >
> > -static unsigned int cpuset_mems_nr(unsigned int *array)
> > +static unsigned int allowed_mems_nr(struct hstate *h)
> >  {
> >       int node;
> >       unsigned int nr = 0;
> > +     nodemask_t *mpol_allowed;
> > +     unsigned int *array = h->free_huge_pages_node;
> > +     gfp_t gfp_mask = htlb_alloc_mask(h);
> > +
> > +     mpol_allowed = policy_nodemask_current(gfp_mask);
> >
> > -     for_each_node_mask(node, cpuset_current_mems_allowed)
> > -             nr += array[node];
> > +     for_each_node_mask(node, cpuset_current_mems_allowed) {
> > +             if (!mpol_allowed ||
> > +                 (mpol_allowed && node_isset(node, *mpol_allowed)))
> > +                     nr += array[node];
> > +     }
> >
> >       return nr;
> >  }
> > @@ -3648,12 +3656,18 @@ static int hugetlb_acct_memory(struct hstate *h, long delta)
> >        * we fall back to check against current free page availability as
> >        * a best attempt and hopefully to minimize the impact of changing
> >        * semantics that cpuset has.
> > +      *
> > +      * Apart from cpuset, we also have memory policy mechanism that
> > +      * also determines from which node the kernel will allocate memory
> > +      * in a NUMA system. So similar to cpuset, we also should consider
> > +      * the memory policy of the current task. Similar to the description
> > +      * above.
> >        */
> >       if (delta > 0) {
> >               if (gather_surplus_pages(h, delta) < 0)
> >                       goto out;
> >
> > -             if (delta > cpuset_mems_nr(h->free_huge_pages_node)) {
> > +             if (delta > allowed_mems_nr(h)) {
> >                       return_unused_surplus_pages(h, delta);
> >                       goto out;
> >               }
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 93fcfc1f2fa2..fce14c3f4f38 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1873,7 +1873,7 @@ static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
> >   * Return a nodemask representing a mempolicy for filtering nodes for
> >   * page allocation
> >   */
> > -static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> > +nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> >  {
> >       /* Lower zones don't get a nodemask applied for MPOL_BIND */
> >       if (unlikely(policy->mode == MPOL_BIND) &&
> > --
> > 2.11.0
> >
> >
>

--
Yours,
Muchun