On Thu, Oct 24, 2024 at 2:16 AM Vlastimil Babka <vbabka@xxxxxxx> wrote: > > On 10/24/24 06:35, Yu Zhao wrote: > > On Wed, Oct 23, 2024 at 1:35 AM Vlastimil Babka <vbabka@xxxxxxx> wrote: > >> > >> On 10/23/24 08:36, Yu Zhao wrote: > >> > On Tue, Oct 22, 2024 at 4:53 AM Vlastimil Babka <vbabka@xxxxxxx> wrote: > >> >> > >> >> +Cc Mel and Matt > >> >> > >> >> On 10/21/24 19:25, Michal Hocko wrote: > >> >> > >> >> Hm I don't think it's completely WAI. The intention is that we should be > >> >> able to unreserve the highatomic pageblocks before going OOM, and there > >> >> seems to be an unintended corner case that if the pageblocks are fully > >> >> exhausted, they are not reachable for unreserving. > >> > > >> > I still think unreserving should only apply to highatomic PBs that > >> > contain free pages. Otherwise, it seems to me that it'd be > >> > self-defecting because: > >> > 1. Unreserving fully used hightatomic PBs can't fulfill the alloc > >> > demand immediately. > >> > >> I thought the alloc demand is only blocked on the pessimistic watermark > >> calculation. Usable free pages exist, but the allocation is not allowed to > >> use them. > > > > I think we are talking about two different problems here: > > 1. The estimation problem. > > 2. The unreserving policy problem. > > > > What you said here is correct w.r.t. the first problem, and I was > > talking about the second problem. > > OK but the problem with unreserving currently makes the problem of > estimation worse and unfixable. > > >> > 2. More importantly, it only takes one alloc failure in > >> > __alloc_pages_direct_reclaim() to reset nr_reserved_highatomic to 2MB, > >> > from as high as 1% of a zone (in this case 1GB). IOW, it makes more > >> > sense to me that highatomic only unreserves what it doesn't fully use > >> > each time unreserve_highatomic_pageblock() is called, not everything > >> > it got (except the last PB). > >> > >> But if the highatomic pageblocks are already full, we are not really > >> removing any actual highatomic reserves just by changing the migratetype and > >> decreasing nr_reserved_highatomic? > > > > If we change the MT, they can be fragmented a lot faster, i.e., from > > the next near OOM condition to upon becoming free. Trying to persist > > over time is what actually makes those PBs more fragmentation > > resistant. > > If we assume the allocations there have similar sizes and lifetimes, then I > guess yeah. > > >> In fact that would allow the reserves > >> grow with some actual free pages in the future. > > > > Good point. I think I can explain it better along this line. > > > > If highatomic is under the limit, both your proposal and the current > > implementation would try to grow, making not much difference. However, > > the current implementation can also reuse previously full PBs when > > they become available. So there is a clear winner here: the current > > implementation. > > I'd say it depends on the user of the highatomic blocks (the workload), > which way ends up better. > > > If highatomic has reached the limit, with your proposal, the growth > > can only happen after unreserve, and unreserve only happens under > > memory pressure. This means it's likely that it tries to grow under > > memory pressure, which is more difficult than the condition where > > there is plenty of memory. For the current implementation, it doesn't > > try to grow, rather, it keeps what it already has, betting those full > > PBs becoming available for reuse. So I don't see a clear winner > > between trying to grow under memory pressure and betting on becoming > > available for reuse. > > Understood. But also note there are many conditions where the current > implementation and my proposal behave the same. If highatomic pageblocks > become full and then only one or few pages from each is freed, it suddenly > becomes possible to unreserve them due to memory pressure, and there is no > reuse for those highatomic allocations anymore. This very different outcome > only depends on whether a single page is free for the unreserve to work, but > from the efficiency of pageblock reusal you describe above a single page is > only a minor difference. My proposal would at least remove the sudden change > of behavior when going from a single free page to no free page. > > >> Hm that assumes we're adding some checks in free fastpath, and for that to > >> work also that there will be a freed page in highatomic PC in near enough > >> future from the decision we need to unreserve something. Which is not so > >> much different from the current assumption we'll find such a free page > >> already in the free list immediately. > >> > >> > To summarize, I think this is an estimation problem, which I would > >> > categorize as a lesser problem than accounting problems. But it sounds > >> > to me that you think it's a policy problem, i.e., the highatomic > >> > unreserving policy is wrong or not properly implemented? > >> > >> Yeah I'd say not properly implemented, but that sounds like a mechanism, not > >> policy problem to me :) > > > > What about adding a new counter to keep track of the size of free > > pages reserved for highatomic? > > That's doable but not so trivial and means starting to handle the highatomic > pageblocks much more carefully, like we do with CMA pageblocks and > NR_FREE_CMA_PAGES counter, otherwise we risk drifting the counter unrecoverably. The counter would be protected by the zone lock: diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 17506e4a2835..86c63d48c08e 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -824,6 +824,7 @@ struct zone { unsigned long watermark_boost; unsigned long nr_reserved_highatomic; + unsigned long nr_free_highatomic; /* * We don't know if the memory that we're going to allocate will be diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8afab64814dc..4d8031817c59 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -644,6 +644,17 @@ static inline void account_freepages(struct zone *zone, int nr_pages, __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages); } +static void account_highatomic_freepages(struct zone *zone, unsigned int order, int old_mt, int new_mt) +{ + int nr_pages = 1 < order; + + if (is_migrate_highatomic(old_mt)) + zone->nr_free_highatomic -= nr_pages; + + if (is_migrate_highatomic(new_mt)) + zone->nr_free_highatomic += nr_pages; +} + /* Used for pages not on another list */ static inline void __add_to_free_list(struct page *page, struct zone *zone, unsigned int order, int migratetype, @@ -660,6 +671,8 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone, else list_add(&page->buddy_list, &area->free_list[migratetype]); area->nr_free++; + + account_highatomic_freepages(zone, order, -1, migratetype); } /* @@ -681,6 +694,8 @@ static inline void move_to_free_list(struct page *page, struct zone *zone, account_freepages(zone, -(1 << order), old_mt); account_freepages(zone, 1 << order, new_mt); + + account_highatomic_freepages(zone, order, old_mt, new_mt); } static inline void __del_page_from_free_list(struct page *page, struct zone *zone, @@ -698,6 +713,8 @@ static inline void __del_page_from_free_list(struct page *page, struct zone *zon __ClearPageBuddy(page); set_page_private(page, 0); zone->free_area[order].nr_free--; + + account_highatomic_freepages(zone, order, migratetype, -1); } static inline void del_page_from_free_list(struct page *page, struct zone *zone, @@ -3085,7 +3102,7 @@ static inline long __zone_watermark_unusable_free(struct zone *z, * over-estimate the size of the atomic reserve but it avoids a search. */ if (likely(!(alloc_flags & ALLOC_RESERVES))) - unusable_free += z->nr_reserved_highatomic; + unusable_free += z->nr_free_highatomic; #ifdef CONFIG_CMA /* If allocation can't use CMA areas don't use free CMA pages */