Re: [Lsf-pc] [LSF/MM TOPIC] Support for 1GB THP

Mel Gorman <mgorman@xxxxxxx> · Tue, 1 Mar 2016 11:00:55 +0000

On Tue, Mar 01, 2016 at 11:25:41AM +0100, Jan Kara wrote:
> Hi,
> 
> On Tue 01-03-16 02:09:11, Matthew Wilcox wrote:
> > There are a few issues around 1GB THP support that I've come up against
> > while working on DAX support that I think may be interesting to discuss
> > in person.
> > 
> >  - Do we want to add support for 1GB THP for anonymous pages?  DAX support
> >    is driving the initial 1GB THP support, but would anonymous VMAs also
> >    benefit from 1GB support?  I'm not volunteering to do this work, but
> >    it might make an interesting conversation if we can identify some users
> >    who think performance would be better if they had 1GB THP support.
> 
> Some time ago I was thinking about 1GB THP and I was wondering: What is the
> motivation for 1GB pages for persistent memory? Is it the savings in memory
> used for page tables? Or is it about the cost of fault?
> 

If anything, the cost of the fault is going to suck as a 1G allocation
and zeroing is required even if the application only needs 4K. It's by
no means a universal win. The savings are in page table usage and TLB
miss cost reduction and TLB footprint. For anonymous memory, it's not
considered to be worth it because the cost of allocating the page is so
high even if it works. There is no guarantee it'll work as fragementation
avoidance only works on the 2M boundary.

It's worse when files are involved because there is a
write-multiplication effect when huge pages are used. Specifically, a
fault incurs 1G of IO even if only 4K is required and then dirty
information is only tracked on a huge page granularity. This increased
IO can offset any TLB-related benefit.

I'm highly skeptical that THP for persistent memory is even worthwhile
once the write multiplication factors and allocation costs are taken into
consideration. I was surprised overall that it was even attempted before
basic features of persistent memory were even completed. I felt that it
should have been avoided until the 4K case was as fast as possible and
hitting problems where TLB was the limiting facto

Given that I recently threw in the towel over the cost of 2M allocations
let alone 1G translations, I'm highly skeptical that 1G anonymous pages
are worth the cost.

> If it is mainly about the fault cost, won't some fault-around logic (i.e.
> filling more PMD entries in one PMD fault) go a long way towards reducing
> fault cost without some complications?
> 

I think this would be a pre-requisite. Basically, the idea is that a 2M
page is reserved, but not allocated in response to a 4K page fault. The
pages are then inserted properly aligned such. If there are faults around
it then use other properly aligned pages and when the 2M chunk is allocated
then promote it at that point. Early research considered where there was
a fill-factor other than 1 that should trigger a hugepage promotion but
it would have to be re-evaluated on modern hardware.

I'm not aware of anyone actually working on such an implementation though
because it'd be a lot of legwork. I wrote a TODO item about this at some
far point in the past that never got to the top of the list

Title: In-place huge page collapsing
Description:
        When collapsing a huge page, the kernel allocates a huge page and
        then copies from the base page. This is expensive. Investigate
        in-place reservation whereby a base page is faulted in but the
        properly placed pages are reserved for that process unless the
        alternative is to fail the allocation. Care would be needed to
        ensure that the kernel does not reclaim because pages are reserved
        or increase contention on zone->lock. If it works correctly we
        would be able to collapse huge pages without copying and it would
        also performance extremely well when the workload uses sparse
        address spaces.

> >  - Cache pressure from 1GB page support.  If we're using NT stores, they
> >    bypass the cache, and all should be good.  But if there are
> >    architectures that support THP and not NT stores, zeroing a page is
> >    just going to obliterate their caches.
> 
> Even doing fsync() - and thus flush all cache lines associated with 1GB
> page - is likely going to take noticeable chunk of time. The granularity of
> cache flushing in kernel is another thing that makes me somewhat cautious
> about 1GB pages.
> 

Problems like this were highlighted in early hugepage-related papers in
the 90's. Even if persistent memory is extremely fast, there is going to
be large costs. In-place promotion would avoid some of the worst of the
costs.

If it was me, I would focus on getting all the basic features of persistent
memory working first, finding if there are workloads that are limited by
TLB pressure and then and only then start worrying about 1G pages. If that
is not done then persistent memory could fall down the same trap that the
VM did whereby huge pages were being used to workaround bottlenecks within
the VM or crappy hardware.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html