Re: [PATCH v5 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving

Gregory Price <gregory.price@xxxxxxxxxxxx> · Thu, 4 Jan 2024 13:59:35 -0500

On Thu, Jan 04, 2024 at 01:39:31PM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@xxxxxxxxxxxx> writes:
> 
> > On Wed, Jan 03, 2024 at 01:46:56PM +0800, Huang, Ying wrote:
> >> Gregory Price <gregory.price@xxxxxxxxxxxx> writes:
> >> > I'm specifically concerned about:
> >> > 	weighted_interleave_nid
> >> > 	alloc_pages_bulk_array_weighted_interleave
> >> >
> >> > I'm unsure whether kmalloc/kfree is safe (and non-offensive) in those
> >> > contexts. If kmalloc/kfree is safe fine, this problem is trivial.
> >> >
> >> > If not, there is no good solution to this without pre-allocating a
> >> > scratch area per-task.
> >> 
> >> You need to audit whether it's safe for all callers.  I guess that you
> >> need to allocate pages after calling, so you can use the same GFP flags
> >> here.
> >> 
> >
> > After picking away i realized that this code is usually going to get
> > called during page fault handling - duh.  So kmalloc is almost never
> > safe (or can fail), and we it's nasty to try to handle those errors.
> 
> Why not just OOM for allocation failure?
>

2 notes:

1) callers of weighted_interleave_nid do not expect OOM conditions, they
   expect a node selection.  On error, we would simply return the local
   numa node without indication of failure.

2) callers of alloc_pages_bulk_array_weighted_interleave receive the
   total number of pages allocated, and they are expected to detect
   pages allocated != pages requested, and then handle whether to
   OOM or simply retry (allocation may fail for a variety of reasons).

By introducing an allocation into this area, if an allocation failure
occurs, we would essentially need to silently squash it and return
either local_node (interleave_nid) or return 0 (bulk allocator) and
allow the allocation logic to handle any subsequent OOM condition.

That felt less desirable than just allocating a scratch space up front
in the mempolicy and avoiding the issue altogether.

> > Instead of doing that, I simply chose to implement the scratch space
> > in the mempolicy structure
> >
> > mempolicy->wil.scratch_weights[MAX_NUMNODES].
> >
> > We eat an extra 1kb of memory in the mempolicy, but it gives us a safe
> > scratch space we can use any time the task is allocating memory, and
> > prevents the need for any fancy error handling.  That seems like a
> > perfectly reasonable tradeoff.
> 
> I don't think that this is a good idea.  The weight array is temporary.
> 

It's temporary, but it's also only used in the context of the task while
the alloc lock is held.

If you think it's fine to introduce another potential OOM generating
spot, then I'll just go ahead and allocate the memory on the fly.

I do want to point out, though, that weighted_interleave_nid is called
per allocated page.  So now we're not just collecting weights to
calculate the offset, we're doing an allocation (that can fail) per page
allocated for that region.

The bulk allocator amortizes the cost of this allocation by doing it
once while allocating a chunk of pages - but the weighted_interleave_nid
function is called per-page.

By comparison, the memory cost to just allocate a static scratch area in
the mempolicy struct is only incurred by tasks with a mempolicy.

So we're talking ~1MB for 1024 threads with mempolicies to avoid error
conditions mid-page-allocation and to reduce the cost associated with
applying weighted interleave.

~Gregory