On Wed, Nov 8, 2023 at 1:28 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > On Mon, Oct 16, 2023 at 03:12:45PM +0800, zhaoyang.huang wrote: > > From: Zhaoyang Huang <zhaoyang.huang@xxxxxxxxxx> > > > > According to current CMA utilization policy, an alloc_pages(GFP_USER) > > could 'steal' UNMOVABLE & RECLAIMABLE page blocks via the help of > > CMA(pass zone_watermark_ok by counting CMA in but use U&R in rmqueue), > > which could lead to following alloc_pages(GFP_KERNEL) fail. > > Solving this by introducing second watermark checking for GFP_MOVABLE, > > which could have the allocation use CMA when proper. > > > > -- Free_pages(30MB) > > | > > | > > -- WMARK_LOW(25MB) > > | > > -- Free_CMA(12MB) > > | > > | > > -- > > We're running into the same issue in production and had an incident > over the weekend because of it. The hosts have a raised > vm.min_free_kbytes for network rx reliability, which makes the > mismatch between free pages and what's actually allocatable by regular > kernel requests quite pronounced. It wasn't OOMing this time, but we > saw very high rates of thrashing while CMA had plenty of headroom. > > I had raised the broader issue around poor CMA utilization before: > https://lore.kernel.org/lkml/20230726145304.1319046-1-hannes@xxxxxxxxxxx/ > > For context, we're using hugetlb_cma at several gigabytes to allow > sharing hosts between jobs that use hugetlb and jobs that don't. > > > @@ -2078,6 +2078,43 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, > > > > } > > > > +#ifdef CONFIG_CMA > > +/* > > + * GFP_MOVABLE allocation could drain UNMOVABLE & RECLAIMABLE page blocks via > > + * the help of CMA which makes GFP_KERNEL failed. Checking if zone_watermark_ok > > + * again without ALLOC_CMA to see if to use CMA first. > > + */ > > +static bool use_cma_first(struct zone *zone, unsigned int order, unsigned int alloc_flags) > > +{ > > + unsigned long watermark; > > + bool cma_first = false; > > + > > + watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK); > > + /* check if GFP_MOVABLE pass previous zone_watermark_ok via the help of CMA */ > > + if (zone_watermark_ok(zone, order, watermark, 0, alloc_flags & (~ALLOC_CMA))) { > > + /* > > + * Balance movable allocations between regular and CMA areas by > > + * allocating from CMA when over half of the zone's free memory > > + * is in the CMA area. > > + */ ok, thanks for point out. Could we simple it like this, which will mis-judge kmalloc within ioctl as GFP_USER. I think it is ok as it is rare if (current_is_kswapd() || !current->mm) gfp_flags = GFP_KERNEL; else gfp_flags = GFP_USER; free_pages = zone_page_state(zone, NR_FREE_PAGES); free_pages -= zone->lowmem_reserve[gfp_zone(gfp_flags)]; free_pages -= wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK); cma_first = free_pages > zone_page_state(zone, NR_FREE_PAGES) / 2); > > + cma_first = (zone_page_state(zone, NR_FREE_CMA_PAGES) > > > + zone_page_state(zone, NR_FREE_PAGES) / 2); > > + } else { > > + /* > > + * watermark failed means UNMOVABLE & RECLAIMBLE is not enough > > + * now, we should use cma first to keep them stay around the > > + * corresponding watermark > > + */ > > + cma_first = true; > > + } > > + return cma_first; > > I think it's a step in the right direction. However, it doesn't take > the lowmem reserves into account. With DMA32 that can be an additional > multiple gigabytes of "free" memory not available to GFP_KERNEL. It > also has a knee in the balancing curve because it doesn't take > reserves into account *until* non-CMA is depleted - at which point it > would already be below the use-CMA threshold by the full reserves and > watermarks. > > A more complete solution would have to plumb the highest_zoneidx > information through the rmqueue family of functions somehow, and > always take unavailable free memory into account: > > --- > Subject: [PATCH] mm: page_alloc: use CMA when kernel allocations are beginning > to fail > > We can get into a situation where kernel allocations are starting to > fail on watermarks, but movable allocations still don't use CMA > because they make up more than half of the free memory. This can > happen in particular with elevated vm.min_free_kbytes settings, where > the remaining free pages aren't available to non-atomic requests. > > Example scenario: > > Free: 3.0G > Watermarks: 2.0G > CMA: 1.4G > -> non-CMA: 1.6G > > CMA isn't used because CMA <= free/2. Kernel allocations fail due to > non-CMA < watermarks. If memory is mostly unreclaimable (e.g. anon > without swap), the kernel is more likely to OOM prematurely. > > Reduce the probability of that happening by taking reserves and > watermarks into account when deciding whether to start using CMA. > > Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx> > --- > mm/page_alloc.c | 93 +++++++++++++++++++++++++++++++------------------ > 1 file changed, 59 insertions(+), 34 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 733732e7e0ba..b9273d7f23b8 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2079,30 +2079,52 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, > > } > > +static bool should_try_cma(struct zone *zone, unsigned int order, > + gfp_t gfp_flags, unsigned int alloc_flags) > +{ > + long free_pages; > + > + if (!IS_ENABLED(CONFIG_CMA) || !(alloc_flags & ALLOC_CMA)) > + return false; > + > + /* > + * CMA regions can be used by movable allocations while > + * they're not otherwise in use. This is a delicate balance: > + * Filling CMA too soon poses a latency risk for actual CMA > + * allocations (think camera app startup). Filling CMA too > + * late risks premature OOMs from non-movable allocations. > + * > + * Start using CMA once it dominates the remaining free > + * memory. Be sure to take watermarks and reserves into > + * account when considering what's truly "free". > + * > + * free_pages can go negative, but that's okay because > + * NR_FREE_CMA_PAGES should not. > + */ > + > + free_pages = zone_page_state(zone, NR_FREE_PAGES); > + free_pages -= zone->lowmem_reserve[gfp_zone(gfp_flags)]; > + free_pages -= wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK); > + > + return zone_page_state(zone, NR_FREE_CMA_PAGES) > free_pages / 2; > +} > + > /* > * Do the hard work of removing an element from the buddy allocator. > * Call me with the zone->lock already held. > */ > static __always_inline struct page * > -__rmqueue(struct zone *zone, unsigned int order, int migratetype, > - unsigned int alloc_flags) > +__rmqueue(struct zone *zone, unsigned int order, gfp_t gfp_flags, > + int migratetype, unsigned int alloc_flags) > { > struct page *page; > > - if (IS_ENABLED(CONFIG_CMA)) { > - /* > - * Balance movable allocations between regular and CMA areas by > - * allocating from CMA when over half of the zone's free memory > - * is in the CMA area. > - */ > - if (alloc_flags & ALLOC_CMA && > - zone_page_state(zone, NR_FREE_CMA_PAGES) > > - zone_page_state(zone, NR_FREE_PAGES) / 2) { > - page = __rmqueue_cma_fallback(zone, order); > - if (page) > - return page; > - } > + if (should_try_cma(zone, order, gfp_flags, alloc_flags)) { > + page = __rmqueue_cma_fallback(zone, order); > + if (page) > + return page; > } > + > retry: > page = __rmqueue_smallest(zone, order, migratetype); > if (unlikely(!page)) { > @@ -2121,7 +2143,7 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype, > * a single hold of the lock, for efficiency. Add them to the supplied list. > * Returns the number of new pages which were placed at *list. > */ > -static int rmqueue_bulk(struct zone *zone, unsigned int order, > +static int rmqueue_bulk(struct zone *zone, unsigned int order, gfp_t gfp_flags, > unsigned long count, struct list_head *list, > int migratetype, unsigned int alloc_flags) > { > @@ -2130,8 +2152,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, > > spin_lock_irqsave(&zone->lock, flags); > for (i = 0; i < count; ++i) { > - struct page *page = __rmqueue(zone, order, migratetype, > - alloc_flags); > + struct page *page = __rmqueue(zone, order, gfp_flags, > + migratetype, alloc_flags); > if (unlikely(page == NULL)) > break; > > @@ -2714,8 +2736,8 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z, > > static __always_inline > struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, > - unsigned int order, unsigned int alloc_flags, > - int migratetype) > + unsigned int order, gfp_t gfp_flags, > + unsigned int alloc_flags, int migratetype) > { > struct page *page; > unsigned long flags; > @@ -2726,7 +2748,8 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, > if (alloc_flags & ALLOC_HIGHATOMIC) > page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC); > if (!page) { > - page = __rmqueue(zone, order, migratetype, alloc_flags); > + page = __rmqueue(zone, order, migratetype, > + gfp_flags, alloc_flags); > > /* > * If the allocation fails, allow OOM handling access > @@ -2806,10 +2829,10 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order) > /* Remove page from the per-cpu list, caller must protect the list */ > static inline > struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, > - int migratetype, > - unsigned int alloc_flags, > - struct per_cpu_pages *pcp, > - struct list_head *list) > + gfp_t gfp_flags, int migratetype, > + unsigned int alloc_flags, > + struct per_cpu_pages *pcp, > + struct list_head *list) > { > struct page *page; > > @@ -2818,7 +2841,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, > int batch = nr_pcp_alloc(pcp, zone, order); > int alloced; > > - alloced = rmqueue_bulk(zone, order, > + alloced = rmqueue_bulk(zone, order, gfp_flags, > batch, list, > migratetype, alloc_flags); > > @@ -2837,8 +2860,9 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, > > /* Lock and remove page from the per-cpu list */ > static struct page *rmqueue_pcplist(struct zone *preferred_zone, > - struct zone *zone, unsigned int order, > - int migratetype, unsigned int alloc_flags) > + struct zone *zone, unsigned int order, > + gfp_t gfp_flags, int migratetype, > + unsigned int alloc_flags) > { > struct per_cpu_pages *pcp; > struct list_head *list; > @@ -2860,7 +2884,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone, > */ > pcp->free_count >>= 1; > list = &pcp->lists[order_to_pindex(migratetype, order)]; > - page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list); > + page = __rmqueue_pcplist(zone, order, gfp_flags, migratetype, > + alloc_flags, pcp, list); > pcp_spin_unlock(pcp); > pcp_trylock_finish(UP_flags); > if (page) { > @@ -2898,13 +2923,13 @@ struct page *rmqueue(struct zone *preferred_zone, > > if (likely(pcp_allowed_order(order))) { > page = rmqueue_pcplist(preferred_zone, zone, order, > - migratetype, alloc_flags); > + gfp_flags, migratetype, alloc_flags); > if (likely(page)) > goto out; > } > > - page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags, > - migratetype); > + page = rmqueue_buddy(preferred_zone, zone, order, gfp_flags, > + alloc_flags, migratetype); > > out: > /* Separate test+clear to avoid unnecessary atomics */ > @@ -4480,8 +4505,8 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid, > continue; > } > > - page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags, > - pcp, pcp_list); > + page = __rmqueue_pcplist(zone, 0, gfp, ac.migratetype, > + alloc_flags, pcp, pcp_list); > if (unlikely(!page)) { > /* Try and allocate at least one page */ > if (!nr_account) { > -- > 2.42.0 >