Re: [PATCH mm-unstable v1] mm/page_alloc: try not to overestimate free highatomic

Yu Zhao <yuzhao@xxxxxxxxxx> · Thu, 24 Oct 2024 15:15:39 -0600

On Thu, Oct 24, 2024 at 2:16 AM Vlastimil Babka <vbabka@xxxxxxx> wrote:
>
> On 10/24/24 06:35, Yu Zhao wrote:
> > On Wed, Oct 23, 2024 at 1:35 AM Vlastimil Babka <vbabka@xxxxxxx> wrote:
> >>
> >> On 10/23/24 08:36, Yu Zhao wrote:
> >> > On Tue, Oct 22, 2024 at 4:53 AM Vlastimil Babka <vbabka@xxxxxxx> wrote:
> >> >>
> >> >> +Cc Mel and Matt
> >> >>
> >> >> On 10/21/24 19:25, Michal Hocko wrote:
> >> >>
> >> >> Hm I don't think it's completely WAI. The intention is that we should be
> >> >> able to unreserve the highatomic pageblocks before going OOM, and there
> >> >> seems to be an unintended corner case that if the pageblocks are fully
> >> >> exhausted, they are not reachable for unreserving.
> >> >
> >> > I still think unreserving should only apply to highatomic PBs that
> >> > contain free pages. Otherwise, it seems to me that it'd be
> >> > self-defecting because:
> >> > 1. Unreserving fully used hightatomic PBs can't fulfill the alloc
> >> > demand immediately.
> >>
> >> I thought the alloc demand is only blocked on the pessimistic watermark
> >> calculation. Usable free pages exist, but the allocation is not allowed to
> >> use them.
> >
> > I think we are talking about two different problems here:
> > 1. The estimation problem.
> > 2. The unreserving policy problem.
> >
> > What you said here is correct w.r.t. the first problem, and I was
> > talking about the second problem.
>
> OK but the problem with unreserving currently makes the problem of
> estimation worse and unfixable.
>
> >> > 2. More importantly, it only takes one alloc failure in
> >> > __alloc_pages_direct_reclaim() to reset nr_reserved_highatomic to 2MB,
> >> > from as high as 1% of a zone (in this case 1GB). IOW, it makes more
> >> > sense to me that highatomic only unreserves what it doesn't fully use
> >> > each time unreserve_highatomic_pageblock() is called, not everything
> >> > it got (except the last PB).
> >>
> >> But if the highatomic pageblocks are already full, we are not really
> >> removing any actual highatomic reserves just by changing the migratetype and
> >> decreasing nr_reserved_highatomic?
> >
> > If we change the MT, they can be fragmented a lot faster, i.e., from
> > the next near OOM condition to upon becoming free. Trying to persist
> > over time is what actually makes those PBs more fragmentation
> > resistant.
>
> If we assume the allocations there have similar sizes and lifetimes, then I
> guess yeah.
>
> >> In fact that would allow the reserves
> >> grow with some actual free pages in the future.
> >
> > Good point. I think I can explain it better along this line.
> >
> > If highatomic is under the limit, both your proposal and the current
> > implementation would try to grow, making not much difference. However,
> > the current implementation can also reuse previously full PBs when
> > they become available. So there is a clear winner here: the current
> > implementation.
>
> I'd say it depends on the user of the highatomic blocks (the workload),
> which way ends up better.
>
> > If highatomic has reached the limit, with your proposal, the growth
> > can only happen after unreserve, and unreserve only happens under
> > memory pressure. This means it's likely that it tries to grow under
> > memory pressure, which is more difficult than the condition where
> > there is plenty of memory. For the current implementation, it doesn't
> > try to grow, rather, it keeps what it already has, betting those full
> > PBs becoming available for reuse. So I don't see a clear winner
> > between trying to grow under memory pressure and betting on becoming
> > available for reuse.
>
> Understood. But also note there are many conditions where the current
> implementation and my proposal behave the same. If highatomic pageblocks
> become full and then only one or few pages from each is freed, it suddenly
> becomes possible to unreserve them due to memory pressure, and there is no
> reuse for those highatomic allocations anymore. This very different outcome
> only depends on whether a single page is free for the unreserve to work, but
> from the efficiency of pageblock reusal you describe above a single page is
> only a minor difference. My proposal would at least remove the sudden change
> of behavior when going from a single free page to no free page.
>
> >> Hm that assumes we're adding some checks in free fastpath, and for that to
> >> work also that there will be a freed page in highatomic PC in near enough
> >> future from the decision we need to unreserve something. Which is not so
> >> much different from the current assumption we'll find such a free page
> >> already in the free list immediately.
> >>
> >> > To summarize, I think this is an estimation problem, which I would
> >> > categorize as a lesser problem than accounting problems. But it sounds
> >> > to me that you think it's a policy problem, i.e., the highatomic
> >> > unreserving policy is wrong or not properly implemented?
> >>
> >> Yeah I'd say not properly implemented, but that sounds like a mechanism, not
> >> policy problem to me :)
> >
> > What about adding a new counter to keep track of the size of free
> > pages reserved for highatomic?
>
> That's doable but not so trivial and means starting to handle the highatomic
> pageblocks much more carefully, like we do with CMA pageblocks and
> NR_FREE_CMA_PAGES counter, otherwise we risk drifting the counter unrecoverably.

The counter would be protected by the zone lock:

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 17506e4a2835..86c63d48c08e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -824,6 +824,7 @@ struct zone {
  unsigned long watermark_boost;

  unsigned long nr_reserved_highatomic;
+ unsigned long nr_free_highatomic;

  /*
  * We don't know if the memory that we're going to allocate will be
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8afab64814dc..4d8031817c59 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -644,6 +644,17 @@ static inline void account_freepages(struct zone
*zone, int nr_pages,
  __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages);
 }

+static void account_highatomic_freepages(struct zone *zone, unsigned
int order, int old_mt, int new_mt)
+{
+ int nr_pages = 1 < order;
+
+ if (is_migrate_highatomic(old_mt))
+ zone->nr_free_highatomic -= nr_pages;
+
+ if (is_migrate_highatomic(new_mt))
+ zone->nr_free_highatomic += nr_pages;
+}
+
 /* Used for pages not on another list */
 static inline void __add_to_free_list(struct page *page, struct zone *zone,
        unsigned int order, int migratetype,
@@ -660,6 +671,8 @@ static inline void __add_to_free_list(struct page
*page, struct zone *zone,
  else
  list_add(&page->buddy_list, &area->free_list[migratetype]);
  area->nr_free++;
+
+ account_highatomic_freepages(zone, order, -1, migratetype);
 }

 /*
@@ -681,6 +694,8 @@ static inline void move_to_free_list(struct page
*page, struct zone *zone,

  account_freepages(zone, -(1 << order), old_mt);
  account_freepages(zone, 1 << order, new_mt);
+
+ account_highatomic_freepages(zone, order, old_mt, new_mt);
 }

 static inline void __del_page_from_free_list(struct page *page,
struct zone *zone,
@@ -698,6 +713,8 @@ static inline void
__del_page_from_free_list(struct page *page, struct zone *zon
  __ClearPageBuddy(page);
  set_page_private(page, 0);
  zone->free_area[order].nr_free--;
+
+ account_highatomic_freepages(zone, order, migratetype, -1);
 }

 static inline void del_page_from_free_list(struct page *page, struct
zone *zone,
@@ -3085,7 +3102,7 @@ static inline long
__zone_watermark_unusable_free(struct zone *z,
  * over-estimate the size of the atomic reserve but it avoids a search.
  */
  if (likely(!(alloc_flags & ALLOC_RESERVES)))
- unusable_free += z->nr_reserved_highatomic;
+ unusable_free += z->nr_free_highatomic;


 #ifdef CONFIG_CMA
  /* If allocation can't use CMA areas don't use free CMA pages */