Re: [PATCH] thp: tail page refcounting fix #6

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Fri, 30 Sep 2011 15:58:58 +0200

Hi everyone,

On Fri, Sep 23, 2011 at 05:57:12PM +0200, Peter Zijlstra wrote:
> On Thu, 2011-09-08 at 18:51 +0200, Andrea Arcangeli wrote:
> 
> > +++ b/arch/powerpc/mm/gup.c
> > +++ b/arch/x86/mm/gup.c
> 
> lacking a diffstat a quick look seems to suggest you missed a few:
> 
> $ ls arch/*/mm/gup.c
> arch/powerpc/mm/gup.c  
> arch/s390/mm/gup.c  
> arch/sh/mm/gup.c  
> arch/sparc/mm/gup.c  
> arch/x86/mm/gup.c

sh should be already ok, sparc too as they don't seem to support
hugetlbfs or THP. s390 32bit probably too. sh actually has some
HUGETLB optional thing but I don't see it handling it in gup...

But I think I missed s39064bit! But after a closer look surprisingly
they weren't ok before this change too! Even ppc.

In fact they will work better with this change applied because
succeeding a page_cache_get_speculative on a TailPage (like it could
have happened before) would have resulted in a VM_BUG_ON before
returning from page_cache_get_speculative. I guess not many are
testing O_DIRECT on hugetlbfs on ppc64,s39064bit. The PageTail check
in powerpc/mm/gup.c suddenly looks like a noop. Weird.

Also while doing this I found longstanding bugs in powerpc. nr
includes also pages found before calling gup_hugepte. So if that race
triggers (mremap,munmap changing the huge pagetable under gup_fast,
which can definitely happen) it'd lead to put_page being called too
many times on the hugetlbfs page.

Also in the below code probably they intended to use "head", not
"page".... "page" is out of bounds... so the rollback in the race case
was going to free a random page even in older kernels.

       if (unlikely(pte_val(pte) != pte_val(*ptep))) {
               /* Could be optimized better */
               while (*nr) {
                       put_page(page);

But I'm not sure why ppc always checks if the pte changed and tries to
rollback. There's no need of that, if gup_fast could run it means
we're not within a mmu_notifier_invalidate_range_start/end critical
section, and mmu_notifier_invalidate_page won't run until after the
tlb flush returned (so after gup_fast returned on the other cpu), and
so the pages are guaranteed to always have page_count >= 0
(munmap/mremap will stop and wait in
mmu_notifier_invalidate_range_start or in mmu_notifier_invalidate_page
where the IPI delivery will wait before releasing the page count, to
flush any secondary mapping before the primary mapping is allowed to
free the page).

In short there are two fishy things in ppc code (in addition to the
race-rollback corrupting memory which is just a minor implementation
bug trivial to correct, besides after THP we've to put_page the right
subpage not just the head so it must be refactored anyway):

1) page_cache_get/add_speculative is not needed anywhere in gup_fast
path of ppc as long as irqs are disabled (it simulates the cpu doing a tlb
miss). Special care has to be taken so you read the ptep to a local
variable (stored in cpu register or kernel stack), and then you
evaluate the local pte_t (stopping reading from the pointer). Then you
do pte_page on the _local_ pte_t variable, and you know you can
get_page without doing any get_page_unless_zero (plus there's a
VM_BUG_ON in get_page to verify we're not running get_page on a page
with a zero count).

Hmmm explanation of point 1) above self-remind that maybe even x86
should always use ACCESS_ONCE, even for the pmds (it already does for
the pte, either that or it has a smp_rmb() after finish reading it),
(probably not required for upper layers as they can't be tear down
until the IPI runs), now there's no way gcc is stupid enough to
re-read from pointer but in theory it could without barrier(). This
isn't just for THP but for hugetlbfs too, it's only theoretical though.

2) the above pte_val(local_pte_t) != pte(*ptep) checks in theory as
useless for ppc, as long as you verify the pte is ok (so you arrived
reading before mremap/munmap changed the pte), the page can't go away
under you because the tlb flush will wait.

I didn't start yet checking s390 in detail but it looks close to ppc
code. Let's sort out ppc first.

Now I'd like to know if the soft tlb miss handler of powerpc changes
something and really requires the code commented point 1) and 2) (even
if that code and race checks are not required on x86 for the reason I
just mentioned...).

I can make a patch to try to keep these page_cache_get/add_speculative
and the pte_val(local_pte_t) != pte(*ptep) checks intact (they're
superfluous but they can't hurt obviously). So I could make a patch
that works with current code, but if I could safely delete those I'd
prefer as they're quite confusing on why they're needed. But before I
do that I need ack from ppc people to be sure it's ok.. I don't know
the assembly of the tlb miss hashtable handler to be sure... I don't
exclude ppc has different ipi/irq semantics for the gup_fast case and
really requires those checks.

CC'ed Benjamin :)

Comments welcome!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>