Re: cma_alloc(), add sleep-and-retry for temporary page pinning

cgoldswo@xxxxxxxxxxxxxx · Thu, 20 Aug 2020 21:17:28 -0700

On 2020-08-11 15:20, cgoldswo@xxxxxxxxxxxxxx wrote:
On 2020-08-06 18:31, Andrew Morton wrote:
On Wed,  5 Aug 2020 19:56:21 -0700 Chris Goldsworthy
<cgoldswo@xxxxxxxxxxxxxx> wrote:

On mobile devices, failure to allocate from a CMA area constitutes a
functional failure.  Sometimes during CMA allocations, we have 
observed
that pages in a CMA area allocated through alloc_pages(), that we're 
trying
to migrate away to make room for a CMA allocation, are temporarily 
pinned.
This temporary pinning can occur when a process that owns the pinned 
page
is being forked (the example is explained further in the commit 
text).
This patch addresses this issue by adding a sleep-and-retry loop in
cma_alloc() . There's another example we know of similar to the above 
that
occurs during exit_mmap() (in zap_pte_range() specifically), but I 
need to
determine if this is still relevant today.

Sounds fairly serious but boy, we're late for 5.9.

I can queue it for 5.10 with a cc:stable so that it gets backported
into earlier kernels a couple of months from now, if we think the
seriousness justifies backporting(?).

Queuing this seems like the best way to proceed, if we were to pick up
this patch.
I think we can forgo back-porting this, as this is something that will 
only be
needed as vendors such as our selves start using Google's Generic 
Kernel Image
(we've carried this patch in our tree for over four years).

And...  it really is a sad little patch, isn't it?  Instead of fixing
the problem, it reduces the problem's probability by 5x.  Can't we do
better than this?

I have one alternative in mind.  I have been able to review the 
exit_mmap()
case, so before proceeding, let's do a breakdown of the problem: we can
categorize the pinning issue we're trying to address here as being one 
of
(1) incrementing _refcount and getting context-switched out before
incrementing _mapcount (applies to forking a process / copy_one_pte()), 
and
(2) decrementing _mapcount and getting context-switched out before
decrementing _refcount (applies to tearing down a process / 
exit_mmap()).
So, one alternative would be to insert preempt_disable/enable() calls 
at
affected sites. So, for the copy_one_pte() pinning case, we could do 
the
following inside of copy_one_pte():

        if (page) {
+               preempt_disable();
                get_page(page);
                page_dup_rmap(page, false);
+               preempt_enable();
                rss[mm_counter(page)]++;
        }

I'm not sure if this approach would be acceptable for the exit_mmap()
pinning case (applicable when CONFIG_MMU_GATHER_NO_GATHER=y).  For the
purposes of this discussion, we can look at two function calls inside 
of
exit_mmap(), in the order they're called in, to show how the pinning is
occuring:

    1. Calling unmap_vmas(): this unmaps the pages in each VMA for an
    exiting task, using zap_pte_range() - zap_pte_range() reduces the
    _mapcount for each page in a VMA, using page_remove_rmap().  After
    calling page_remove_rmap(), the page is placed into a list in
    __tlb_remove_page().  This list of pages will be used when flushing
    TLB entries later on during the process teardown.

    2. Calling tlb_finish_mmu(): This is will flush the TLB entries
    associated with pages, before calling put_page() on them, using the
    previously collected pages from __tlb_remove_page() - the call flow 
is
    tlb_flush_mmu() > tlb_flush_mmu() > tlb_flush_mmu_free()
    > tlb_batch_pages_flush() > free_pages_and_swap_cache() >
    release_pages(), where release_pages() is described as a "batched
    put_page()"

The preempt_disable/enable() approach would entail doing the following
inside of exit_mmap():

+       preempt_disable();
        unmap_vmas(&tlb, vma, 0, -1);
        free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 
USER_PGTABLES_CEILING);
        tlb_finish_mmu(&tlb, 0, -1);
+       preempt_enable();

I'm not sure doing this is feasible, given how long it could take to do 
the
process teardown.

The good thing about this patch is that it has been stable in our 
kernel
for four years (though for some SoCs we increased the retry counts).  
One
thing to stress is that there are other instances of CMA page pinning, 
that
this patch isn't attempting to address. Please let me know if you're 
okay
with queuing this for the 5.10 merge window - if you are, I can add an
option to configure the number of retries, and will resend the patch 
once
the 5.9 merge window closes.

Thanks,

Chris.

Hi Andrew,

Have you been able to give the patch any further consideration?

Thanks,

Chris.

--
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,
a Linux Foundation Collaborative Project