On 2020-09-28 22:59, Christoph Hellwig wrote:
On Mon, Sep 28, 2020 at 01:30:27PM -0700, Chris Goldsworthy wrote:
CMA allocations will fail if 'pinned' pages are in a CMA area, since
we
cannot migrate pinned pages. The _refcount of a struct page being
greater
than _mapcount for that page can cause pinning for anonymous pages.
This
is because try_to_unmap(), which (1) is called in the CMA allocation
path,
and (2) decrements both _refcount and _mapcount for a page, will stop
unmapping a page from VMAs once the _mapcount for a page reaches 0.
This
implies that after try_to_unmap() has finished successfully for a page
where _recount > _mapcount, that _refcount will be greater than 0.
Later
in the CMA allocation path in migrate_page_move_mapping(), we will
have one
more reference count than intended for anonymous pages, meaning the
allocation will fail for that page.
If a process ends up causing _refcount > _mapcount for a page (by
either
incrementing _recount or decrementing _mapcount), such that the
process is
context switched out after modifying one refcount but before modifying
the
other, the page will be temporarily pinned.
One example of where _refcount can be greater than _mapcount is inside
of
zap_pte_range(), which is called for all the entries of a PMD when a
process is exiting, to unmap the process's memory. Inside of
zap_pte_range(), after unammping a page with page_remove_rmap(), we
have
that _recount > _mapcount. _refcount can only be decremented after a
TLB
flush is performed for the page - this doesn't occur until enough
pages
have been batched together for flushing. The flush can either occur
inside
of zap_pte_range() (during the same invocation or a later one), or if
there
aren't enough pages collected by the time we unmap all of the pages in
a
process, the flush will occur in tlb_finish_mmu() in exit_mmap().
After
the flush has occurred, tlb_batch_pages_flush() will decrement the
references on the flushed pages.
Another such example like the above is inside of copy_one_pte(), which
is
called during a fork. For PTEs for which pte_present(pte) == true,
copy_one_pte() will increment the _refcount field followed by the
_mapcount field of a page.
So, inside of cma_alloc(), add the option of letting users pass in
__GFP_NOFAIL to indicate that we should retry CMA allocations
indefinitely,
in the event that alloc_contig_range() returns -EBUSY after having
scanned
a whole CMA-region bitmap.
And who is going to use this? AS-is this just seems to add code that
isn't actually used and thus actually tested. (In addition to beeing
a relly bad idea as discussed before)
Hi Christoph,
That had slipped my mind - what we would have submitted would have been
a modified /drivers/dma-heap/heaps/cma_heap.c, which would have created
a "linux,cma-nofail" heap, that when allocated from, passes GFP_NOFAIL
to cma_alloc(). But, since this retry approach (finite and infinite)
has effectively been nacked, I've gone back to the drawing board to find
either (1) a lock based approach to solving this (as posed by Andrew
Morton here: https://lkml.org/lkml/2020/8/21/1490), or (2) using
preempt_disable() calls.
Thanks,
Chris.
--- a/kernel/dma/contiguous.c
+++ b/kernel/dma/contiguous.c
@@ -196,7 +196,7 @@ struct page *dma_alloc_from_contiguous(struct
device *dev, size_t count,
if (align > CONFIG_CMA_ALIGNMENT)
align = CONFIG_CMA_ALIGNMENT;
- return cma_alloc(dev_get_cma_area(dev), count, align, no_warn);
+ return cma_alloc(dev_get_cma_area(dev), count, align, no_warn ?
__GFP_NOWARN : 0);
Also don't add pointlessly overlong lines.
--
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum,
a Linux Foundation Collaborative Project