Re: [PATCH v2 2/4] mm: introduce cma_alloc_bulk API

Minchan Kim <minchan@xxxxxxxxxx> · Wed, 2 Dec 2020 08:00:04 -0800

On Wed, Dec 02, 2020 at 10:14:41AM +0100, David Hildenbrand wrote:
> On 01.12.20 18:51, Minchan Kim wrote:
> > There is a need for special HW to require bulk allocation of
> > high-order pages. For example, 4800 * order-4 pages, which
> > would be minimum, sometimes, it requires more.
> > 
> > To meet the requirement, a option reserves 300M CMA area and
> > requests the whole 300M contiguous memory. However, it doesn't
> > work if even one of those pages in the range is long-term pinned
> > directly or indirectly. The other option is to ask higher-order
> 
> My latest knowledge is that pages in the CMA area are never long term
> pinned.
> 
> https://lore.kernel.org/lkml/20201123090129.GD27488@xxxxxxxxxxxxxx/
> 
> "gup already tries to deal with long term pins on CMA regions and migrate
> to a non CMA region. Have a look at __gup_longterm_locked."
> 
> We should rather identify ways how that is still possible and get rid of
> them.
> 
> 
> Now, short-term pinnings and PCP are other issues where
> alloc_contig_range() could be improved (e.g., in contrast to a FAST
> mode, a HARD mode which temporarily disables the PCP, ...).

GUP is just one of cases to make CMA failure directly but there are still
bunch of cases indirectly - make short-term pin to long-term by several
reasons (CPU scheduling/IO scheduling/memory pressure/locking dependency)

In the end, they would be finally released but is much unreliable/slow so
idea just skip the page and continue to search the other pages in CMA
area because what we needed is just severl high-order pages, not that
big 300M chunk.

> 
> > size (e.g., 2M) than requested order(64K) repeatedly until driver
> > could gather necessary amount of memory. Basically, this approach
> > makes the allocation very slow due to cma_alloc's function
> > slowness and it could be stuck on one of the pageblocks if it
> > encounters unmigratable page.
> > 
> > To solve the issue, this patch introduces cma_alloc_bulk.
> > 
> > 	int cma_alloc_bulk(struct cma *cma, unsigned int align,
> > 		bool fast, unsigned int order, size_t nr_requests,
> > 		struct page **page_array, size_t *nr_allocated);
> > 
> > Most parameters are same with cma_alloc but it additionally passes
> > vector array to store allocated memory. What's different with cma_alloc
> > is it will skip pageblocks without waiting/stopping if it has unmovable
> > page so that API continues to scan other pageblocks to find requested
> > order page.
> > 
> > cma_alloc_bulk is best effort approach in that it skips some pageblocks
> > if they have unmovable pages unlike cma_alloc. It doesn't need to be
> > perfect from the beginning at the cost of performance. Thus, the API
> > takes "bool fast parameter" which is propagated into alloc_contig_range to
> > avoid significat overhead functions to inrecase CMA allocation success
> > ratio(e.g., migration retrial, PCP, LRU draining per pageblock)
> > at the cost of less allocation success ratio. If the caller couldn't
> > allocate enough, they could call it with "false" to increase success ratio
> > if they are okay to expense the overhead for the success ratio.
> 
> Just so I understand what the idea is:
> 
> alloc_contig_range() sometimes fails on CMA regions when trying to
> allocate big chunks (e.g., 300M). Instead of tackling that issue, you
> rather allocate plenty of small chunks, and make these small allocations
> fail faster/ make the allocations less reliable. Correct?

Please look at above.

> 
> I don't really have a strong opinion on that. Giving up fast rather than
> trying for longer sounds like a useful thing to have - but I wonder if
> it's strictly necessary for the use case you describe.
> 
> I'd like to hear Michals opinion on that.
> 
> > 
> > Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx>
> > ---
> >  include/linux/cma.h |   5 ++
> >  include/linux/gfp.h |   2 +
> >  mm/cma.c            | 126 ++++++++++++++++++++++++++++++++++++++++++--
> >  mm/page_alloc.c     |  19 ++++---
> >  4 files changed, 140 insertions(+), 12 deletions(-)
> > 
> > diff --git a/include/linux/cma.h b/include/linux/cma.h
> > index 217999c8a762..7375d3131804 100644
> > --- a/include/linux/cma.h
> > +++ b/include/linux/cma.h
> > @@ -46,6 +46,11 @@ extern int cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
> >  					struct cma **res_cma);
> >  extern struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
> >  			      bool no_warn);
> > +
> > +extern int cma_alloc_bulk(struct cma *cma, unsigned int align, bool fast,
> > +			unsigned int order, size_t nr_requests,
> > +			struct page **page_array, size_t *nr_allocated);
> > +
> >  extern bool cma_release(struct cma *cma, const struct page *pages, unsigned int count);
> >  
> >  extern int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data);
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index ad5872699692..75bfb673d75b 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -627,6 +627,8 @@ static inline bool pm_suspended_storage(void)
> >  enum alloc_contig_mode {
> >  	/* try several ways to increase success ratio of memory allocation */
> >  	ALLOC_CONTIG_NORMAL,
> > +	/* avoid costly functions to make the call fast */
> > +	ALLOC_CONTIG_FAST,
> >  };
> >  
> >  /* The below functions must be run on a range from a single zone. */
> > diff --git a/mm/cma.c b/mm/cma.c
> > index 8010c1ba04b0..4459045fa717 100644
> > --- a/mm/cma.c
> > +++ b/mm/cma.c
> > @@ -32,6 +32,7 @@
> >  #include <linux/highmem.h>
> >  #include <linux/io.h>
> >  #include <linux/kmemleak.h>
> > +#include <linux/swap.h>
> >  #include <trace/events/cma.h>
> >  
> >  #include "cma.h"
> > @@ -397,6 +398,14 @@ static void cma_debug_show_areas(struct cma *cma)
> >  static inline void cma_debug_show_areas(struct cma *cma) { }
> >  #endif
> >  
> > +static void reset_page_kasan_tag(struct page *page, int count)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < count; i++)
> > +		page_kasan_tag_reset(page + i);
> > +}
> > +
> >  /**
> >   * cma_alloc() - allocate pages from contiguous area
> >   * @cma:   Contiguous memory region for which the allocation is performed.
> > @@ -414,7 +423,6 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
> >  	unsigned long pfn = -1;
> >  	unsigned long start = 0;
> >  	unsigned long bitmap_maxno, bitmap_no, bitmap_count;
> > -	size_t i;
> >  	struct page *page = NULL;
> >  	int ret = -ENOMEM;
> >  
> > @@ -479,10 +487,8 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
> >  	 * blocks being marked with different tags. Reset the tags to ignore
> >  	 * those page blocks.
> >  	 */
> > -	if (page) {
> > -		for (i = 0; i < count; i++)
> > -			page_kasan_tag_reset(page + i);
> > -	}
> > +	if (page)
> > +		reset_page_kasan_tag(page, count);
> >  
> >  	if (ret && !no_warn) {
> >  		pr_err("%s: alloc failed, req-size: %zu pages, ret: %d\n",
> > @@ -494,6 +500,116 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
> >  	return page;
> >  }
> >  
> > +/*
> > + * cma_alloc_bulk() - allocate high order bulk pages from contiguous area with
> > + * 		best effort. It will usually be used for private @cma
> > + *
> > + * @cma:	contiguous memory region for which the allocation is performed.
> > + * @align:	requested alignment of pages (in PAGE_SIZE order).
> > + * @fast:	will skip costly opeartions if it's true.
> > + * @order:	requested page order
> > + * @nr_requests: the number of 2^order pages requested to be allocated as input,
> > + * @page_array:	page_array pointer to store allocated pages (must have space
> > + *		for at least nr_requests)
> > + * @nr_allocated: the number of 2^order pages allocated as output
> > + *
> > + * This function tries to allocate up to @nr_requests @order pages on specific
> > + * contiguous memory area. If @fast has true, it will avoid costly functions
> > + * to increase allocation success ratio so it will be faster but might return
> > + * less than requested number of pages. User could retry it with true if it is
> > + * needed.
> > + *
> > + * Return: it will return 0 only if all pages requested by @nr_requestsed are
> > + * allocated. Otherwise, it returns negative error code.
> > + *
> > + * Note: Regardless of success/failure, user should check @nr_allocated to see
> > + * how many @order pages are allocated and free those pages when they are not
> > + * needed.
> > + */
> > +int cma_alloc_bulk(struct cma *cma, unsigned int align, bool fast,
> > +			unsigned int order, size_t nr_requests,
> > +			struct page **page_array, size_t *nr_allocated)
> > +{
> > +	int ret = 0;
> > +	size_t i = 0;
> > +	unsigned long nr_pages_needed = nr_requests * (1 << order);
> > +	unsigned long nr_chunk_pages, nr_pages;
> > +	unsigned long mask, offset;
> > +	unsigned long pfn = -1;
> > +	unsigned long start = 0;
> > +	unsigned long bitmap_maxno, bitmap_no, bitmap_count;
> > +	struct page *page = NULL;
> > +	enum alloc_contig_mode mode = fast ? ALLOC_CONTIG_FAST :
> > +						ALLOC_CONTIG_NORMAL;
> > +	*nr_allocated = 0;
> > +	if (!cma || !cma->count || !cma->bitmap || !page_array)
> > +		return -EINVAL;
> > +
> > +	if (!nr_pages_needed)
> > +		return 0;
> > +
> > +	nr_chunk_pages = 1 << max_t(unsigned int, order, pageblock_order);
> > +
> > +	mask = cma_bitmap_aligned_mask(cma, align);
> > +	offset = cma_bitmap_aligned_offset(cma, align);
> > +	bitmap_maxno = cma_bitmap_maxno(cma);
> > +
> > +	lru_add_drain_all();
> > +	drain_all_pages(NULL);
> > +
> > +	while (nr_pages_needed) {
> > +		nr_pages = min(nr_chunk_pages, nr_pages_needed);
> > +
> > +		bitmap_count = cma_bitmap_pages_to_bits(cma, nr_pages);
> > +		mutex_lock(&cma->lock);
> > +		bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
> > +				bitmap_maxno, start, bitmap_count, mask,
> > +				offset);
> > +		if (bitmap_no >= bitmap_maxno) {
> > +			mutex_unlock(&cma->lock);
> > +			break;
> > +		}
> > +		bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
> > +		/*
> > +		 * It's safe to drop the lock here. If the migration fails
> > +		 * cma_clear_bitmap will take the lock again and unmark it.
> > +		 */
> > +		mutex_unlock(&cma->lock);
> > +
> > +		pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
> > +		ret = alloc_contig_range(pfn, pfn + nr_pages, MIGRATE_CMA,
> > +						GFP_KERNEL|__GFP_NOWARN, mode);
> > +		if (ret) {
> > +			cma_clear_bitmap(cma, pfn, nr_pages);
> > +			if (ret != -EBUSY)
> > +				break;
> > +
> > +			/* continue to search next block */
> > +			start = (pfn + nr_pages - cma->base_pfn) >>
> > +						cma->order_per_bit;
> > +			continue;
> > +		}
> > +
> > +		page = pfn_to_page(pfn);
> > +		while (nr_pages) {
> > +			page_array[i++] = page;
> > +			reset_page_kasan_tag(page, 1 << order);
> > +			page += 1 << order;
> > +			nr_pages -= 1 << order;
> > +			nr_pages_needed -= 1 << order;
> > +		}
> > +
> > +		start = bitmap_no + bitmap_count;
> > +	}
> > +
> > +	*nr_allocated = i;
> > +
> > +	if (!ret && nr_pages_needed)
> > +		ret = -EBUSY;
> > +
> > +	return ret;
> > +}
> > +
> >  /**
> >   * cma_release() - release allocated pages
> >   * @cma:   Contiguous memory region for which the allocation is performed.
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index adfbfd95fbc3..2a1799ff14fc 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -8463,7 +8463,8 @@ static unsigned long pfn_max_align_up(unsigned long pfn)
> >  
> >  /* [start, end) must belong to a single zone. */
> >  static int __alloc_contig_migrate_range(struct compact_control *cc,
> > -					unsigned long start, unsigned long end)
> > +					unsigned long start, unsigned long end,
> > +					unsigned int max_tries)
> >  {
> >  	/* This function is based on compact_zone() from compaction.c. */
> >  	unsigned int nr_reclaimed;
> > @@ -8491,7 +8492,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
> >  				break;
> >  			}
> >  			tries = 0;
> > -		} else if (++tries == 5) {
> > +		} else if (++tries == max_tries) {
> >  			ret = ret < 0 ? ret : -EBUSY;
> >  			break;
> >  		}
> > @@ -8553,6 +8554,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
> >  	unsigned long outer_start, outer_end;
> >  	unsigned int order;
> >  	int ret = 0;
> > +	bool fast_mode = mode == ALLOC_CONTIG_FAST;
> >  
> >  	struct compact_control cc = {
> >  		.nr_migratepages = 0,
> > @@ -8595,7 +8597,8 @@ int alloc_contig_range(unsigned long start, unsigned long end,
> >  	if (ret)
> >  		return ret;
> >  
> > -	drain_all_pages(cc.zone);
> > +	if (!fast_mode)
> > +		drain_all_pages(cc.zone);
> >  
> >  	/*
> >  	 * In case of -EBUSY, we'd like to know which page causes problem.
> > @@ -8607,7 +8610,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
> >  	 * allocated.  So, if we fall through be sure to clear ret so that
> >  	 * -EBUSY is not accidentally used or returned to caller.
> >  	 */
> > -	ret = __alloc_contig_migrate_range(&cc, start, end);
> > +	ret = __alloc_contig_migrate_range(&cc, start, end, fast_mode ? 1 : 5);
> >  	if (ret && ret != -EBUSY)
> >  		goto done;
> >  	ret =0;
> > @@ -8629,7 +8632,8 @@ int alloc_contig_range(unsigned long start, unsigned long end,
> >  	 * isolated thus they won't get removed from buddy.
> >  	 */
> >  
> > -	lru_add_drain_all();
> > +	if (!fast_mode)
> > +		lru_add_drain_all();
> >  
> >  	order = 0;
> >  	outer_start = start;
> > @@ -8656,8 +8660,9 @@ int alloc_contig_range(unsigned long start, unsigned long end,
> >  
> >  	/* Make sure the range is really isolated. */
> >  	if (test_pages_isolated(outer_start, end, 0)) {
> > -		pr_info_ratelimited("%s: [%lx, %lx) PFNs busy\n",
> > -			__func__, outer_start, end);
> > +		if (!fast_mode)
> > +			pr_info_ratelimited("%s: [%lx, %lx) PFNs busy\n",
> > +				__func__, outer_start, end);
> >  		ret = -EBUSY;
> >  		goto done;
> >  	}
> > 
> 
> 
> -- 
> Thanks,
> 
> David / dhildenb
>