On Mon, Sep 28, 2020 at 01:30:27PM -0700, Chris Goldsworthy wrote: > CMA allocations will fail if 'pinned' pages are in a CMA area, since we > cannot migrate pinned pages. The _refcount of a struct page being greater > than _mapcount for that page can cause pinning for anonymous pages. This > is because try_to_unmap(), which (1) is called in the CMA allocation path, > and (2) decrements both _refcount and _mapcount for a page, will stop > unmapping a page from VMAs once the _mapcount for a page reaches 0. This > implies that after try_to_unmap() has finished successfully for a page > where _recount > _mapcount, that _refcount will be greater than 0. Later > in the CMA allocation path in migrate_page_move_mapping(), we will have one > more reference count than intended for anonymous pages, meaning the > allocation will fail for that page. > > If a process ends up causing _refcount > _mapcount for a page (by either > incrementing _recount or decrementing _mapcount), such that the process is > context switched out after modifying one refcount but before modifying the > other, the page will be temporarily pinned. > > One example of where _refcount can be greater than _mapcount is inside of > zap_pte_range(), which is called for all the entries of a PMD when a > process is exiting, to unmap the process's memory. Inside of > zap_pte_range(), after unammping a page with page_remove_rmap(), we have > that _recount > _mapcount. _refcount can only be decremented after a TLB > flush is performed for the page - this doesn't occur until enough pages > have been batched together for flushing. The flush can either occur inside > of zap_pte_range() (during the same invocation or a later one), or if there > aren't enough pages collected by the time we unmap all of the pages in a > process, the flush will occur in tlb_finish_mmu() in exit_mmap(). After > the flush has occurred, tlb_batch_pages_flush() will decrement the > references on the flushed pages. > > Another such example like the above is inside of copy_one_pte(), which is > called during a fork. For PTEs for which pte_present(pte) == true, > copy_one_pte() will increment the _refcount field followed by the > _mapcount field of a page. > > So, inside of cma_alloc(), add the option of letting users pass in > __GFP_NOFAIL to indicate that we should retry CMA allocations indefinitely, > in the event that alloc_contig_range() returns -EBUSY after having scanned > a whole CMA-region bitmap. And who is going to use this? AS-is this just seems to add code that isn't actually used and thus actually tested. (In addition to beeing a relly bad idea as discussed before) > --- a/kernel/dma/contiguous.c > +++ b/kernel/dma/contiguous.c > @@ -196,7 +196,7 @@ struct page *dma_alloc_from_contiguous(struct device *dev, size_t count, > if (align > CONFIG_CMA_ALIGNMENT) > align = CONFIG_CMA_ALIGNMENT; > > - return cma_alloc(dev_get_cma_area(dev), count, align, no_warn); > + return cma_alloc(dev_get_cma_area(dev), count, align, no_warn ? __GFP_NOWARN : 0); Also don't add pointlessly overlong lines.