On 10/26/24 05:36, Yu Zhao wrote: > OOM kills due to vastly overestimated free highatomic reserves were > observed: > > ... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ... > Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ... > Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB > > The second line above shows that the OOM kill was due to the following > condition: > > free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB) > > And the third line shows there were no free pages in any > MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type > 'H'. Therefore __zone_watermark_unusable_free() underestimated the > usable free memory by over 1GB, which resulted in the unnecessary OOM > kill above. > > The comments in __zone_watermark_unusable_free() warns about the > potential risk, i.e., > > If the caller does not have rights to reserves below the min > watermark then subtract the high-atomic reserves. This will > over-estimate the size of the atomic reserve but it avoids a search. > > However, it is possible to keep track of free pages in reserved > highatomic pageblocks with a new per-zone counter nr_free_highatomic > protected by the zone lock, to avoid a search when calculating the It's only possible to track this reliably since the "mm: page_alloc: freelist migratetype hygiene" patchset was merged, which explains why nr_reserved_highatomic was used until now, even if it's imprecise. But I agree it should be a good solution to the problem now. > usable free memory. And the cost would be minimal, i.e., simple > arithmetics in the highatomic alloc/free/move paths. > > Reported-by: Link Lin <linkl@xxxxxxxxxx> > Signed-off-by: Yu Zhao <yuzhao@xxxxxxxxxx> However, implementation wise that patchset also centralized everything to account_freepages() which currently handles NR_FREE_PAGES and NR_FREE_CMA_PAGES, and your patch differs from that approach. Can't you simply add a NR_FREE_HIGHATOMIC update there instead of creating account_highatomic_freepages() which is hooked to 3 different places? And for consistency indeed add a zone_stat_item counter instead a struct zone field? Then it becomes visible in e.g. /proc/zoneinfo and /proc/vmstat too. Bonus points for printing it in the oom/alloc failure reports as well (wherever we print NR_FREE_CMA_PAGES). Thanks! Vlastimil > --- > include/linux/mmzone.h | 1 + > mm/page_alloc.c | 23 ++++++++++++++++++++--- > 2 files changed, 21 insertions(+), 3 deletions(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 2e8c4307c728..5e8f567753bd 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -825,6 +825,7 @@ struct zone { > unsigned long watermark_boost; > > unsigned long nr_reserved_highatomic; > + unsigned long nr_free_highatomic; > > /* > * We don't know if the memory that we're going to allocate will be > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index c81762c49b00..43ecc7d75a2a 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -644,6 +644,18 @@ static inline void account_freepages(struct zone *zone, int nr_pages, > __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages); > } > > +static void account_highatomic_freepages(struct zone *zone, unsigned int order, > + int old_mt, int new_mt) > +{ > + int delta = 1 << order; > + > + if (is_migrate_highatomic(old_mt)) > + WRITE_ONCE(zone->nr_free_highatomic, zone->nr_free_highatomic - delta); > + > + if (is_migrate_highatomic(new_mt)) > + WRITE_ONCE(zone->nr_free_highatomic, zone->nr_free_highatomic + delta); > +} > + > /* Used for pages not on another list */ > static inline void __add_to_free_list(struct page *page, struct zone *zone, > unsigned int order, int migratetype, > @@ -660,6 +672,8 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone, > else > list_add(&page->buddy_list, &area->free_list[migratetype]); > area->nr_free++; > + > + account_highatomic_freepages(zone, order, -1, migratetype); > } > > /* > @@ -681,6 +695,8 @@ static inline void move_to_free_list(struct page *page, struct zone *zone, > > account_freepages(zone, -(1 << order), old_mt); > account_freepages(zone, 1 << order, new_mt); > + > + account_highatomic_freepages(zone, order, old_mt, new_mt); > } > > static inline void __del_page_from_free_list(struct page *page, struct zone *zone, > @@ -698,6 +714,8 @@ static inline void __del_page_from_free_list(struct page *page, struct zone *zon > __ClearPageBuddy(page); > set_page_private(page, 0); > zone->free_area[order].nr_free--; > + > + account_highatomic_freepages(zone, order, migratetype, -1); > } > > static inline void del_page_from_free_list(struct page *page, struct zone *zone, > @@ -3119,11 +3137,10 @@ static inline long __zone_watermark_unusable_free(struct zone *z, > > /* > * If the caller does not have rights to reserves below the min > - * watermark then subtract the high-atomic reserves. This will > - * over-estimate the size of the atomic reserve but it avoids a search. > + * watermark then subtract the free pages reserved for highatomic. > */ > if (likely(!(alloc_flags & ALLOC_RESERVES))) > - unusable_free += z->nr_reserved_highatomic; > + unusable_free += READ_ONCE(z->nr_free_highatomic); > > #ifdef CONFIG_CMA > /* If allocation can't use CMA areas don't use free CMA pages */