Re: [PATCH v5 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving

Gregory Price <gregory.price@xxxxxxxxxxxx> · Tue, 2 Jan 2024 15:30:36 -0500

On Tue, Jan 02, 2024 at 04:42:42PM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@xxxxxxxxxxxx> writes:
> 
> > On Wed, Dec 27, 2023 at 04:32:37PM +0800, Huang, Ying wrote:
> >> Gregory Price <gourry.memverge@xxxxxxxxx> writes:
> >> 
> >> > +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
> >> > +{
> >> > +	nodemask_t nodemask = pol->nodes;
> >> > +	unsigned int target, weight_total = 0;
> >> > +	int nid;
> >> > +	unsigned char weights[MAX_NUMNODES];
> >> 
> >> MAX_NUMNODSE could be as large as 1024.  1KB stack space may be too
> >> large?
> >> 
> >
> > I've been struggling with a good solution to this.  We need a local copy
> > of weights to prevent weights from changing out from under us during
> > allocation (which may take quite some time), but it seemed unwise to
> > to allocate 1KB heap in this particular path.
> >
> > Is my concern unfounded?  If so, I can go ahead and add the allocation
> > code.
> 
> Please take a look at NODEMASK_ALLOC().
>

This is not my question. NODEMASK_ALLOC calls kmalloc/kfree. 

Some of the allocations on the stack can be replaced with a scratch
allocation, that's no big deal.

I'm specifically concerned about:
	weighted_interleave_nid
	alloc_pages_bulk_array_weighted_interleave

I'm unsure whether kmalloc/kfree is safe (and non-offensive) in those
contexts. If kmalloc/kfree is safe fine, this problem is trivial.

If not, there is no good solution to this without pre-allocating a
scratch area per-task.

> >> I don't think barrier() is needed to wait for memory operations for
> >> stack.  It's usually used for cross-processor memory order.
> >>
> >
> > This is present in the old interleave code.  To the best of my
> > understanding, the concern is for mempolicy->nodemask rebinding that can
> > occur when cgroups.cpusets.mems_allowed changes.
> >
> > so we can't iterate over (mempolicy->nodemask), we have to take a local
> > copy.
> >
> > My *best* understanding of the barrier here is to prevent the compiler
> > from reordering operations such that it attempts to optimize out the
> > local copy (or do lazy-fetch).
> >
> > It is present in the original interleave code, so I pulled it forward to
> > this, but I have not tested whether this is a bit paranoid or not.
> >
> > from `interleave_nid`:
> >
> >  /*
> >   * The barrier will stabilize the nodemask in a register or on
> >   * the stack so that it will stop changing under the code.
> >   *
> >   * Between first_node() and next_node(), pol->nodes could be changed
> >   * by other threads. So we put pol->nodes in a local stack.
> >   */
> >  barrier();
> 
> Got it.  This is kind of READ_ONCE() for nodemask.  To avoid to add
> comments all over the place.  Can we implement a wrapper for it?  For
> example, memcpy_once().  __read_once_size() in
> tools/include/linux/compiler.h can be used as reference.
> 
> Because node_weights[] may be changed simultaneously too.  We may need
> to consider similar issue for it too.  But RCU seems more appropriate
> for node_weights[].
> 

Weights are collected individually onto the stack because we have to sum
them up before we actually apply the weights.

A stale weight is not offensive.  RCU is not needed and doesn't help.

The reason the barrier is needed is not weights, it's the nodemask.

So you basically just want to replace barrier() with this and drop the
copy/pasted comments:

static void read_once_policy_nodemask(struct mempolicy *pol, nodemask_t *mask)
{
        /*
         * The barrier will stabilize the nodemask in a register or on
         * the stack so that it will stop changing under the code.
         *
         * Between first_node() and next_node(), pol->nodes could be changed
         * by other threads. So we put pol->nodes in a local stack.
         */
        barrier();
        __builtin_memcpy(mask, &pol->nodes, sizeof(nodemask_t));
        barrier();
}

- nodemask_t nodemask = pol->nodemask
- barrier()
+ nodemask_t nodemask;
+ read_once_policy_nodemask(pol, &nodemask)

Is that right?

~Gregory