Re: [PATCH] mm: page_alloc: consume available CMA space first

Johannes Weiner <hannes@xxxxxxxxxxx> · Thu, 27 Jul 2023 11:34:13 -0400

On Wed, Jul 26, 2023 at 04:38:11PM -0700, Roman Gushchin wrote:
> On Wed, Jul 26, 2023 at 10:53:04AM -0400, Johannes Weiner wrote:
> > On a memcache setup with heavy anon usage and no swap, we routinely
> > see premature OOM kills with multiple gigabytes of free space left:
> > 
> >     Node 0 Normal free:4978632kB [...] free_cma:4893276kB
> > 
> > This free space turns out to be CMA. We set CMA regions aside for
> > potential hugetlb users on all of our machines, figuring that even if
> > there aren't any, the memory is available to userspace allocations.
> > 
> > When the OOMs trigger, it's from unmovable and reclaimable allocations
> > that aren't allowed to dip into CMA. The non-CMA regions meanwhile are
> > dominated by the anon pages.
> > 
> > 
> > Because we have more options for CMA pages, change the policy to
> > always fill up CMA first. This reduces the risk of premature OOMs.
> 
> I suspect it might cause regressions on small(er) devices where
> a relatively small cma area (Mb's) is often reserved for a use by various
> device drivers, which can't handle allocation failures well (even interim
> allocation failures). A startup time can regress too: migrating pages out of
> cma will take time.

The page allocator is currently happy to give away all CMA memory to
movables before entering reclaim. It will use CMA even before falling
back to a different migratetype.

Do these small setups take special precautions to never fill memory?
Proactively trim file cache? Never swap? Because AFAICS, unless they
do so, this would only change the timing of when CMA fills up, not if.

> And given the velocity of kernel upgrades on such devices, we won't learn about
> it for next couple of years.

That's true. However, a potential regression with this would show up
fairly early in kernel validation since CMA would fill up in a more
predictable timeline. And the change is easy to revert, too.

Given that we have a concrete problem with the current behavior, I
think it's fair to require a higher bar for proof that this will
indeed cause a regression elsewhere before raising the bar on the fix.

> > Movable pages can be migrated out of CMA when necessary, but we don't
> > have a mechanism to migrate them *into* CMA to make room for unmovable
> > allocations. The only recourse we have for these pages is reclaim,
> > which due to a lack of swap is unavailable in our case.
> 
> Idk, should we introduce such a mechanism? Or use some alternative heuristics,
> which will be a better compromise between those who need cma allocations always
> pass and those who use large cma areas for opportunistic huge page allocations.
> Of course, we can add a boot flag/sysctl/per-cma-area flag, but I doubt we want
> really this.

Right, having migration into CMA could be a viable option as well.

But I would like to learn more from CMA users and their expectations,
since there isn't currently a guarantee that CMA stays empty.

This patch would definitely be the simpler solution. It would also
shave some branches and cycles off the buddy hotpath for many users
that don't actively use CMA but have CONFIG_CMA=y (I checked archlinux
and Fedora, not sure about Suse).