Re: [RFC PATCH 00/16] 1GB THP support on x86_64

Roman Gushchin <guro@xxxxxx> · Fri, 4 Sep 2020 14:10:45 -0700

On Fri, Sep 04, 2020 at 09:42:07AM +0200, Michal Hocko wrote:
> On Thu 03-09-20 09:25:27, Roman Gushchin wrote:
> > On Thu, Sep 03, 2020 at 09:32:54AM +0200, Michal Hocko wrote:
> > > On Wed 02-09-20 14:06:12, Zi Yan wrote:
> > > > From: Zi Yan <ziy@xxxxxxxxxx>
> > > > 
> > > > Hi all,
> > > > 
> > > > This patchset adds support for 1GB THP on x86_64. It is on top of
> > > > v5.9-rc2-mmots-2020-08-25-21-13.
> > > > 
> > > > 1GB THP is more flexible for reducing translation overhead and increasing the
> > > > performance of applications with large memory footprint without application
> > > > changes compared to hugetlb.
> > > 
> > > Please be more specific about usecases. This better have some strong
> > > ones because THP code is complex enough already to add on top solely
> > > based on a generic TLB pressure easing.
> > 
> > Hello, Michal!
> > 
> > We at Facebook are using 1 GB hugetlbfs pages and are getting noticeable
> > performance wins on some workloads.
> 
> Let me clarify. I am not questioning 1GB (or large) pages in general. I
> believe it is quite clear that there are usecases which hugely benefit
> from them.  I am mostly asking for the transparent part of it which
> traditionally means that userspace mostly doesn't have to care and get
> them. 2MB THPs have established certain expectations mostly a really    
> aggressive pro-active instanciation. This has bitten us many times and
> create a "you need to disable THP to fix your problem whatever that is"
> cargo cult. I hope we do not want to repeat that mistake here again.

Absolutely, I agree with all above. 1 GB THPs have even fewer chances
to be allocated automatically without hurting overall performance.

I believe that historically the THP allocation success rate and cost were not good
enough to have a strict interface, that's why the "best effort" approach was used.
Maybe I'm wrong here. Also in some cases (e.g. desktop) an opportunistic approach
looks like "it's some perf boost for free". However in case of large distributed
systems it's important to get a predictable and uniform performance across nodes,
so "maybe some hosts will perform better" is not giving much.

> 
> > Historically we allocated gigantic pages at the boot time, but recently moved
> > to cma-based dynamic approach. Still, hugetlbfs interface requires more management
> > than we would like to do. 1 GB THP seems to be a better alternative. So I definitely
> > see it as a very useful feature.
> > 
> > Given the cost of an allocation, I'm slightly skeptical about an automatic
> > heuristics-based approach, but if an application can explicitly mark target areas
> > with madvise(), I don't see why it wouldn't work.
> 
> An explicit opt-in sounds much more appropriate to me as well. If we go
> with a specific API then I would not make it 1GB pages specific. Why
> cannot we have an explicit interface to "defragment" address space
> range into large pages and the kernel would use large pages where
> appropriate? Or is the additional copying prohibitively expensive?

Can you, please, elaborate a bit more here? It seems like madvise(MADV_HUGEPAGE)
provides something similar to what you're describing, but there are lot
of details here, so I'm probably missing something.

Thank you!