Re: [PATCH 4/4] hugetlb: add support for gigantic page allocation at runtime

Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx> · Mon, 07 Apr 2014 15:03:18 -0400

On Mon, Apr 07, 2014 at 02:49:35PM -0400, Luiz Capitulino wrote:
> On Mon, 07 Apr 2014 13:58:29 -0400
> Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx> wrote:
> 
> > On Wed, Apr 02, 2014 at 02:08:48PM -0400, Luiz Capitulino wrote:
> > > HugeTLB is limited to allocating hugepages whose size are less than
> > > MAX_ORDER order. This is so because HugeTLB allocates hugepages via
> > > the buddy allocator. Gigantic pages (that is, pages whose size is
> > > greater than MAX_ORDER order) have to be allocated at boottime.
> > > 
> > > However, boottime allocation has at least two serious problems. First,
> > > it doesn't support NUMA and second, gigantic pages allocated at
> > > boottime can't be freed.
> > > 
> > > This commit solves both issues by adding support for allocating gigantic
> > > pages during runtime. It works just like regular sized hugepages,
> > > meaning that the interface in sysfs is the same, it supports NUMA,
> > > and gigantic pages can be freed.
> > > 
> > > For example, on x86_64 gigantic pages are 1GB big. To allocate two 1G
> > > gigantic pages on node 1, one can do:
> > > 
> > >  # echo 2 > \
> > >    /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
> > > 
> > > And to free them later:
> > > 
> > >  # echo 0 > \
> > >    /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
> > > 
> > > The one problem with gigantic page allocation at runtime is that it
> > > can't be serviced by the buddy allocator. To overcome that problem, this
> > > series scans all zones from a node looking for a large enough contiguous
> > > region. When one is found, it's allocated by using CMA, that is, we call
> > > alloc_contig_range() to do the actual allocation. For example, on x86_64
> > > we scan all zones looking for a 1GB contiguous region. When one is found
> > > it's allocated by alloc_contig_range().
> > > 
> > > One expected issue with that approach is that such gigantic contiguous
> > > regions tend to vanish as time goes by. The best way to avoid this for
> > > now is to make gigantic page allocations very early during boot, say
> > > from a init script. Other possible optimization include using compaction,
> > > which is supported by CMA but is not explicitly used by this commit.
> > > 
> > > It's also important to note the following:
> > > 
> > >  1. My target systems are x86_64 machines, so I have only tested 1GB
> > >     pages allocation/release. I did try to make this arch indepedent
> > >     and expect it to work on other archs but didn't try it myself
> > > 
> > >  2. I didn't add support for hugepage overcommit, that is allocating
> > >     a gigantic page on demand when
> > >    /proc/sys/vm/nr_overcommit_hugepages > 0. The reason is that I don't
> > >    think it's reasonable to do the hard and long work required for
> > >    allocating a gigantic page at fault time. But it should be simple
> > >    to add this if wanted
> > > 
> > > Signed-off-by: Luiz Capitulino <lcapitulino@xxxxxxxxxx>
> > 
> > I agree to the basic idea. One question below ...
> 
> Good to hear that.
> 
> > > ---
> > >  arch/x86/include/asm/hugetlb.h |  10 +++
> > >  mm/hugetlb.c                   | 177 ++++++++++++++++++++++++++++++++++++++---
> > >  2 files changed, 176 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/arch/x86/include/asm/hugetlb.h b/arch/x86/include/asm/hugetlb.h
> > > index a809121..2b262f7 100644
> > > --- a/arch/x86/include/asm/hugetlb.h
> > > +++ b/arch/x86/include/asm/hugetlb.h
> > > @@ -91,6 +91,16 @@ static inline void arch_release_hugepage(struct page *page)
> > >  {
> > >  }
> > >  
> > > +static inline int arch_prepare_gigantic_page(struct page *page)
> > > +{
> > > +	return 0;
> > > +}
> > > +
> > > +static inline void arch_release_gigantic_page(struct page *page)
> > > +{
> > > +}
> > > +
> > > +
> > >  static inline void arch_clear_hugepage_flags(struct page *page)
> > >  {
> > >  }
> > 
> > These are defined only on arch/x86, but called in generic code.
> > Does it cause build failure on other archs?
> 
> Hmm, probably. The problem here is that I'm unable to test this
> code in other archs. So I think the best solution for the first
> merge is to make the build of this feature conditional to x86_64?

Yes, I think that's safer.

Naoya

> Then the first person interested in making this work in other
> archs add the generic code. Sounds reasonable?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>