Re: [v4 PATCH] mm: thp: handle page cache THP correctly in PageTransCompoundMap

Yang Shi <yang.shi@xxxxxxxxxxxxxxxxx> · Thu, 24 Oct 2019 09:33:11 -0700

On 10/24/19 6:55 AM, Matthew Wilcox wrote:
On Thu, Oct 24, 2019 at 05:19:35AM +0800, Yang Shi wrote:
We have usecase to use tmpfs as QEMU memory backend and we would like to
take the advantage of THP as well.  But, our test shows the EPT is not
PMD mapped even though the underlying THP are PMD mapped on host.
The number showed by /sys/kernel/debug/kvm/largepage is much less than
the number of PMD mapped shmem pages as the below:

7f2778200000-7f2878200000 rw-s 00000000 00:14 262232 /dev/shm/qemu_back_mem.mem.Hz2hSf (deleted)
Size:            4194304 kB
[snip]
AnonHugePages:         0 kB
ShmemPmdMapped:   579584 kB
[snip]
Locked:                0 kB

cat /sys/kernel/debug/kvm/largepages
12

And some benchmarks do worse than with anonymous THPs.

By digging into the code we figured out that commit 127393fbe597 ("mm:
thp: kvm: fix memory corruption in KVM with THP enabled") checks if
there is a single PTE mapping on the page for anonymous THP when
setting up EPT map.  But, the _mapcount < 0 check doesn't fit to page
cache THP since every subpage of page cache THP would get _mapcount
inc'ed once it is PMD mapped, so PageTransCompoundMap() always returns
false for page cache THP.  This would prevent KVM from setting up PMD
mapped EPT entry.

So we need handle page cache THP correctly.  However, when page cache
THP's PMD gets split, kernel just remove the map instead of setting up
PTE map like what anonymous THP does.  Before KVM calls get_user_pages()
the subpages may get PTE mapped even though it is still a THP since the
page cache THP may be mapped by other processes at the mean time.

Checking its _mapcount and whether the THP has PTE mapped or not.
Although this may report some false negative cases (PTE mapped by other
processes), it looks not trivial to make this accurate.
I don't understand why you care how it's mapped into userspace.  If there
is a PMD-sized page in the page cache, then you can use a PMD mapping
in the EPT tables to map it.  Why would another process having a PTE
mapping on the page cause you to not use a PMD mapping?

We don't care if the THP is PTE mapped by other process, but either 
PageDoubleMap flag or _mapcount/compound_mapcount can't tell us if the 
PTE map comes from the current process or other process unless gup could 
return pmd's status.

I think the commit 127393fbe597 ("mm: thp: kvm: fix memory corruption in 
KVM with THP enabled") elaborates the trade-off clearly (not full commit 
log, just paste the most related part):

   Ideally instead of the page->_mapcount < 1 check, get_user_pages()
    should return the granularity of the "page" mapping in the "mm" passed
    to get_user_pages().  However it's non trivial change to pass the "pmd"
    status belonging to the "mm" walked by get_user_pages up the stack (up
    to the caller of get_user_pages).  So the fix just checks if there is
    not a single pte mapping on the page returned by get_user_pages, and in
    turn if the caller can assume that the whole compound page is mapped in
    the current "mm" (in a pmd_trans_huge()).  In such case the entire
    compound page is safe to map into the secondary MMU without additional
    get_user_pages() calls on the surrounding tail/head pages.  In addition
    of being faster, not having to run other get_user_pages() calls also
    reduces the memory footprint of the secondary MMU fault in case the pmd
    split happened as result of memory pressure.