On Mon, Oct 16, 2023 at 03:12:45PM +0800, zhaoyang.huang wrote: > From: Zhaoyang Huang <zhaoyang.huang@xxxxxxxxxx> > > According to current CMA utilization policy, an alloc_pages(GFP_USER) > could 'steal' UNMOVABLE & RECLAIMABLE page blocks via the help of > CMA(pass zone_watermark_ok by counting CMA in but use U&R in rmqueue), > which could lead to following alloc_pages(GFP_KERNEL) fail. > Solving this by introducing second watermark checking for GFP_MOVABLE, > which could have the allocation use CMA when proper. > > -- Free_pages(30MB) > | > | > -- WMARK_LOW(25MB) > | > -- Free_CMA(12MB) > | > | > -- We're running into the same issue in production and had an incident over the weekend because of it. The hosts have a raised vm.min_free_kbytes for network rx reliability, which makes the mismatch between free pages and what's actually allocatable by regular kernel requests quite pronounced. It wasn't OOMing this time, but we saw very high rates of thrashing while CMA had plenty of headroom. I had raised the broader issue around poor CMA utilization before: https://lore.kernel.org/lkml/20230726145304.1319046-1-hannes@xxxxxxxxxxx/ For context, we're using hugetlb_cma at several gigabytes to allow sharing hosts between jobs that use hugetlb and jobs that don't. > @@ -2078,6 +2078,43 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, > > } > > +#ifdef CONFIG_CMA > +/* > + * GFP_MOVABLE allocation could drain UNMOVABLE & RECLAIMABLE page blocks via > + * the help of CMA which makes GFP_KERNEL failed. Checking if zone_watermark_ok > + * again without ALLOC_CMA to see if to use CMA first. > + */ > +static bool use_cma_first(struct zone *zone, unsigned int order, unsigned int alloc_flags) > +{ > + unsigned long watermark; > + bool cma_first = false; > + > + watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK); > + /* check if GFP_MOVABLE pass previous zone_watermark_ok via the help of CMA */ > + if (zone_watermark_ok(zone, order, watermark, 0, alloc_flags & (~ALLOC_CMA))) { > + /* > + * Balance movable allocations between regular and CMA areas by > + * allocating from CMA when over half of the zone's free memory > + * is in the CMA area. > + */ > + cma_first = (zone_page_state(zone, NR_FREE_CMA_PAGES) > > + zone_page_state(zone, NR_FREE_PAGES) / 2); > + } else { > + /* > + * watermark failed means UNMOVABLE & RECLAIMBLE is not enough > + * now, we should use cma first to keep them stay around the > + * corresponding watermark > + */ > + cma_first = true; > + } > + return cma_first; I think it's a step in the right direction. However, it doesn't take the lowmem reserves into account. With DMA32 that can be an additional multiple gigabytes of "free" memory not available to GFP_KERNEL. It also has a knee in the balancing curve because it doesn't take reserves into account *until* non-CMA is depleted - at which point it would already be below the use-CMA threshold by the full reserves and watermarks. A more complete solution would have to plumb the highest_zoneidx information through the rmqueue family of functions somehow, and always take unavailable free memory into account: --- Subject: [PATCH] mm: page_alloc: use CMA when kernel allocations are beginning to fail We can get into a situation where kernel allocations are starting to fail on watermarks, but movable allocations still don't use CMA because they make up more than half of the free memory. This can happen in particular with elevated vm.min_free_kbytes settings, where the remaining free pages aren't available to non-atomic requests. Example scenario: Free: 3.0G Watermarks: 2.0G CMA: 1.4G -> non-CMA: 1.6G CMA isn't used because CMA <= free/2. Kernel allocations fail due to non-CMA < watermarks. If memory is mostly unreclaimable (e.g. anon without swap), the kernel is more likely to OOM prematurely. Reduce the probability of that happening by taking reserves and watermarks into account when deciding whether to start using CMA. Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx> --- mm/page_alloc.c | 93 +++++++++++++++++++++++++++++++------------------ 1 file changed, 59 insertions(+), 34 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 733732e7e0ba..b9273d7f23b8 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2079,30 +2079,52 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, } +static bool should_try_cma(struct zone *zone, unsigned int order, + gfp_t gfp_flags, unsigned int alloc_flags) +{ + long free_pages; + + if (!IS_ENABLED(CONFIG_CMA) || !(alloc_flags & ALLOC_CMA)) + return false; + + /* + * CMA regions can be used by movable allocations while + * they're not otherwise in use. This is a delicate balance: + * Filling CMA too soon poses a latency risk for actual CMA + * allocations (think camera app startup). Filling CMA too + * late risks premature OOMs from non-movable allocations. + * + * Start using CMA once it dominates the remaining free + * memory. Be sure to take watermarks and reserves into + * account when considering what's truly "free". + * + * free_pages can go negative, but that's okay because + * NR_FREE_CMA_PAGES should not. + */ + + free_pages = zone_page_state(zone, NR_FREE_PAGES); + free_pages -= zone->lowmem_reserve[gfp_zone(gfp_flags)]; + free_pages -= wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK); + + return zone_page_state(zone, NR_FREE_CMA_PAGES) > free_pages / 2; +} + /* * Do the hard work of removing an element from the buddy allocator. * Call me with the zone->lock already held. */ static __always_inline struct page * -__rmqueue(struct zone *zone, unsigned int order, int migratetype, - unsigned int alloc_flags) +__rmqueue(struct zone *zone, unsigned int order, gfp_t gfp_flags, + int migratetype, unsigned int alloc_flags) { struct page *page; - if (IS_ENABLED(CONFIG_CMA)) { - /* - * Balance movable allocations between regular and CMA areas by - * allocating from CMA when over half of the zone's free memory - * is in the CMA area. - */ - if (alloc_flags & ALLOC_CMA && - zone_page_state(zone, NR_FREE_CMA_PAGES) > - zone_page_state(zone, NR_FREE_PAGES) / 2) { - page = __rmqueue_cma_fallback(zone, order); - if (page) - return page; - } + if (should_try_cma(zone, order, gfp_flags, alloc_flags)) { + page = __rmqueue_cma_fallback(zone, order); + if (page) + return page; } + retry: page = __rmqueue_smallest(zone, order, migratetype); if (unlikely(!page)) { @@ -2121,7 +2143,7 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype, * a single hold of the lock, for efficiency. Add them to the supplied list. * Returns the number of new pages which were placed at *list. */ -static int rmqueue_bulk(struct zone *zone, unsigned int order, +static int rmqueue_bulk(struct zone *zone, unsigned int order, gfp_t gfp_flags, unsigned long count, struct list_head *list, int migratetype, unsigned int alloc_flags) { @@ -2130,8 +2152,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, spin_lock_irqsave(&zone->lock, flags); for (i = 0; i < count; ++i) { - struct page *page = __rmqueue(zone, order, migratetype, - alloc_flags); + struct page *page = __rmqueue(zone, order, gfp_flags, + migratetype, alloc_flags); if (unlikely(page == NULL)) break; @@ -2714,8 +2736,8 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z, static __always_inline struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, - unsigned int order, unsigned int alloc_flags, - int migratetype) + unsigned int order, gfp_t gfp_flags, + unsigned int alloc_flags, int migratetype) { struct page *page; unsigned long flags; @@ -2726,7 +2748,8 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, if (alloc_flags & ALLOC_HIGHATOMIC) page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC); if (!page) { - page = __rmqueue(zone, order, migratetype, alloc_flags); + page = __rmqueue(zone, order, migratetype, + gfp_flags, alloc_flags); /* * If the allocation fails, allow OOM handling access @@ -2806,10 +2829,10 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order) /* Remove page from the per-cpu list, caller must protect the list */ static inline struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, - int migratetype, - unsigned int alloc_flags, - struct per_cpu_pages *pcp, - struct list_head *list) + gfp_t gfp_flags, int migratetype, + unsigned int alloc_flags, + struct per_cpu_pages *pcp, + struct list_head *list) { struct page *page; @@ -2818,7 +2841,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, int batch = nr_pcp_alloc(pcp, zone, order); int alloced; - alloced = rmqueue_bulk(zone, order, + alloced = rmqueue_bulk(zone, order, gfp_flags, batch, list, migratetype, alloc_flags); @@ -2837,8 +2860,9 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, /* Lock and remove page from the per-cpu list */ static struct page *rmqueue_pcplist(struct zone *preferred_zone, - struct zone *zone, unsigned int order, - int migratetype, unsigned int alloc_flags) + struct zone *zone, unsigned int order, + gfp_t gfp_flags, int migratetype, + unsigned int alloc_flags) { struct per_cpu_pages *pcp; struct list_head *list; @@ -2860,7 +2884,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone, */ pcp->free_count >>= 1; list = &pcp->lists[order_to_pindex(migratetype, order)]; - page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list); + page = __rmqueue_pcplist(zone, order, gfp_flags, migratetype, + alloc_flags, pcp, list); pcp_spin_unlock(pcp); pcp_trylock_finish(UP_flags); if (page) { @@ -2898,13 +2923,13 @@ struct page *rmqueue(struct zone *preferred_zone, if (likely(pcp_allowed_order(order))) { page = rmqueue_pcplist(preferred_zone, zone, order, - migratetype, alloc_flags); + gfp_flags, migratetype, alloc_flags); if (likely(page)) goto out; } - page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags, - migratetype); + page = rmqueue_buddy(preferred_zone, zone, order, gfp_flags, + alloc_flags, migratetype); out: /* Separate test+clear to avoid unnecessary atomics */ @@ -4480,8 +4505,8 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid, continue; } - page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags, - pcp, pcp_list); + page = __rmqueue_pcplist(zone, 0, gfp, ac.migratetype, + alloc_flags, pcp, pcp_list); if (unlikely(!page)) { /* Try and allocate at least one page */ if (!nr_account) { -- 2.42.0