Re: [Lsf-pc] [LSF/MM TOPIC] Support for 1GB THP

Jan Kara <jack@xxxxxxx> · Tue, 1 Mar 2016 11:25:41 +0100

Hi,

On Tue 01-03-16 02:09:11, Matthew Wilcox wrote:
> There are a few issues around 1GB THP support that I've come up against
> while working on DAX support that I think may be interesting to discuss
> in person.
> 
>  - Do we want to add support for 1GB THP for anonymous pages?  DAX support
>    is driving the initial 1GB THP support, but would anonymous VMAs also
>    benefit from 1GB support?  I'm not volunteering to do this work, but
>    it might make an interesting conversation if we can identify some users
>    who think performance would be better if they had 1GB THP support.

Some time ago I was thinking about 1GB THP and I was wondering: What is the
motivation for 1GB pages for persistent memory? Is it the savings in memory
used for page tables? Or is it about the cost of fault?

If it is mainly about the fault cost, won't some fault-around logic (i.e.
filling more PMD entries in one PMD fault) go a long way towards reducing
fault cost without some complications?

>  - Latency of a major page fault.  According to various public reviews,
>    main memory bandwidth is about 30GB/s on a Core i7-5960X with 4
>    DDR4 channels.  I think people are probably fairly unhappy about
>    doing only 30 page faults per second.  So maybe we need a more complex
>    scheme to handle major faults where we insert a temporary 2MB mapping,
>    prepare the other 2MB pages in the background, then merge them into
>    a 1GB mapping when they're completed.

Yeah, here is one of the complications I have mentioned above ;)

>  - Cache pressure from 1GB page support.  If we're using NT stores, they
>    bypass the cache, and all should be good.  But if there are
>    architectures that support THP and not NT stores, zeroing a page is
>    just going to obliterate their caches.

Even doing fsync() - and thus flush all cache lines associated with 1GB
page - is likely going to take noticeable chunk of time. The granularity of
cache flushing in kernel is another thing that makes me somewhat cautious
about 1GB pages.

> Other topics that might interest people from a VM/FS point of view:
> 
>  - Uses for (or replacement of) the radix tree.  We're currently
>    looking at using the radix tree with DAX in order to reduce the number
>    of calls into the filesystem.  That's leading to various enhancements
>    to the radix tree, such as support for a lock bit for exceptional
>    entries (Neil Brown), and support for multi-order entries (me).
>    Is the (enhanced) radix tree the right data structure to be using
>    for this brave new world of huge pages in the page cache, or should
>    we be looking at some other data structure like an RB-tree?

I was also thinking whether we wouldn't be better off with some other data
structure than radix tree for DAX. And I didn't really find anything that
I'd be satisfied with. The main advantages of radix tree I see are - it is
of constant depth, it supports lockless lookups, it is relatively simple
(although with the additions we'd need this advantage slowly vanishes), it
is pretty space efficient for common cases.

For your multi-order entries I was wondering whether we shouldn't relax the
requirement that all nodes have the same number of slots - e.g. we could
have number of slots variable with node depth so that PMD and eventually PUD
multi-order slots end up being a single entry at appropriate radix tree
level.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html