Re: [PATCH v5 00/11] mempolicy2, mbind2, and weighted interleave

Gregory Price <gregory.price@xxxxxxxxxxxx> · Tue, 26 Dec 2023 02:45:18 -0500

On Mon, Dec 25, 2023 at 03:54:18PM +0800, Huang, Ying wrote:
> Gregory Price <gourry.memverge@xxxxxxxxx> writes:
> 
> > For example, the stream benchmark demonstrates that default interleave
> > is actively harmful, where weighted interleave is beneficial.
> >
> > Hardware: 1-socket 8 channel DDR5 + 1 CXL expander in PCIe x16
> > Default interleave : -78% (slower than DRAM)
> > Global weighting   : -6% to +4% (workload dependant)
> > Targeted weights   : +2.5% to +4% (consistently better than DRAM)
> >
> > If nothing else, this shows how awful round-robin interleave is.
> 
> I guess the performance of the default policy, local (fast memory)
> first, may be even better in some situation?  For example, before the
> bandwidth of DRAM is saturated?
> 

Yes - but it's more complicated than that.

Global weighting here means we did `numactl -w --interleave ...`, which
means *all* memory regions will be interleaved.  Code, stack, heap, etc.

Targeted weights means we used mbind2() with local weights, which only
targted specific heap regions.

The default policy was better than global weighting likely as a result
of things like stack/code being distributed to higher latency memory
produced a measurable overhead.

To provide this, we only applied weights to bandwidth driving regions,
and as a result we demonstrated a measurable performance increase.

So yes, the defautl policy may be better in some situations - but that
will be true of any policy.

> I understand that you may want to limit the memory usage of the fast
> memory too.  But IMHO, that is another requirements.  That should be
> enforced by something like per-node memory limit.
> 

This interface does not limit memory usage of a particular node, it 
distributes data according to the requested policy.

Nuanced distinction, but important.  If nodes become exhausted, tasks
are still free to allocate memory from any node in the nodemask, even if
it violates the requested mempolicy.

This is consistent with the existing behavior of mempolicy.

> > =====================================================================
> > (Patches 3-6) Refactoring mempolicy for code-reuse
> >
> > To avoid multiple paths of mempolicy creation, we should refactor the
> > existing code to enable the designed extensibility, and refactor
> > existing users to utilize the new interface (while retaining the
> > existing userland interface).
> >
> > This set of patches introduces a new mempolicy_args structure, which
> > is used to more fully describe a requested mempolicy - to include
> > existing and future extensions.
> >
> > /*
> >  * Describes settings of a mempolicy during set/get syscalls and
> >  * kernel internal calls to do_set_mempolicy()
> >  */
> > struct mempolicy_args {
> >     unsigned short mode;            /* policy mode */
> >     unsigned short mode_flags;      /* policy mode flags */
> >     int home_node;                  /* mbind: use MPOL_MF_HOME_NODE */
> >     nodemask_t *policy_nodes;       /* get/set/mbind */
> >     unsigned char *il_weights;      /* for mode MPOL_WEIGHTED_INTERLEAVE */
> > };
> 
> According to
> 
> https://www.geeksforgeeks.org/difference-between-argument-and-parameter-in-c-c-with-examples/
> 
> it appears that "parameter" are better than "argument" for struct name
> here.  It appears that current kernel source supports this too.
> 
> $ grep 'struct[\t ]\+[a-zA-Z0-9]\+_param' -r include/linux | wc -l
> 411
> $ grep 'struct[\t ]\+[a-zA-Z0-9]\+_arg' -r include/linux | wc -l
> 25
> 

Will change.

> > This arg structure will eventually be utilized by the following
> > interfaces:
> >     mpol_new() - new mempolicy creation
> >     do_get_mempolicy() - acquiring information about mempolicy
> >     do_set_mempolicy() - setting the task mempolicy
> >     do_mbind()         - setting a vma mempolicy
> >
> > do_get_mempolicy() is completely refactored to break it out into
> > separate functionality based on the flags provided by get_mempolicy(2)
> >     MPOL_F_MEMS_ALLOWED: acquires task->mems_allowed
> >     MPOL_F_ADDR: acquires information on vma policies
> >     MPOL_F_NODE: changes the output for the policy arg to node info
> >
> > We refactor the get_mempolicy syscall flatten the logic based on these
> > flags, and aloow for set_mempolicy2() to re-use the underlying logic.
> >
> > The result of this refactor, and the new mempolicy_args structure, is
> > that extensions like 'sys_set_mempolicy_home_node' can now be directly
> > integrated into the initial call to 'set_mempolicy2', and that more
> > complete information about a mempolicy can be returned with a single
> > call to 'get_mempolicy2', rather than multiple calls to 'get_mempolicy'
> >
> >
> > =====================================================================
> > (Patches 7-10) set_mempolicy2, get_mempolicy2, mbind2
> >
> > These interfaces are the 'extended' counterpart to their relatives.
> > They use the userland 'struct mpol_args' structure to communicate a
> > complete mempolicy configuration to the kernel.  This structure
> > looks very much like the kernel-internal 'struct mempolicy_args':
> >
> > struct mpol_args {
> >         /* Basic mempolicy settings */
> >         __u16 mode;
> >         __u16 mode_flags;
> >         __s32 home_node;
> >         __u64 pol_maxnodes;
> 
> I understand that we want to avoid hole in struct.  But I still feel
> uncomfortable to use __u64 for a small.  But I don't have solution too.
> Anyone else has some idea?
>

maxnode has been an `unsigned long` in every other interface for quite
some time.  Seems better to keep this consistent rather than it suddenly
become `unsigned long` over here and `unsigned short` over there.

> >         __aligned_u64 pol_nodes;
> >         __aligned_u64 *il_weights;      /* of size pol_maxnodes */
> 
> Typo?  Should be,
> 

derp derp

> >
> > The 'flags' argument for mbind2 is the same as 'mbind', except with
> > the addition of MPOL_MF_HOME_NODE to denote whether the 'home_node'
> > field should be utilized.
> >
> > The 'flags' argument for get_mempolicy2 allows for MPOL_F_ADDR to
> > allow operating on VMA policies, but MPOL_F_NODE and MPOL_F_MEMS_ALLOWED
> > behavior has been omitted, since get_mempolicy() provides this already.
> 
> I still think that it's a good idea to make it possible to deprecate
> get_mempolicy().  How about use a union as follows?
> 
> struct mpol_mems_allowed {
>          __u64 maxnodes;
>          __aligned_u64 nodemask;
> };
> 
> union mpol_info {
>         struct mpol_args args;
>         struct mpol_mems_allowed mems_allowed;
>         __s32 node;
> };
> 

See my other email.  I've come around to see mems_allowed as a wart that
needs to be removed.  The same information is already available via
sysfs cpusets.mems and cpusets.mems_effective.

Additionally, mems_allowed isn't even technically part of the mempolicy,
so if we did want an interface to acquire the infomation, you'd prefer
to just implement a stand-alone syscall.

The sysfs interface seems sufficient though.

`policy_node` is a similar "why does this even exist" type feature,
except that it can still be used from get_mempolicy() and if there is an
actual reason to extend it to get_mempolicy2() it can be added to
mpol_params.

~Gregory