Gregory Price <gourry.memverge@xxxxxxxxx> writes: > Weighted interleave is a new interleave policy intended to make > use of a the new distributed-memory environment made available > by CXL. The existing interleave mechanism does an even round-robin > distribution of memory across all nodes in a nodemask, while > weighted interleave can distribute memory across nodes according > the available bandwidth that that node provides. > > As tests below show, "default interleave" can cause major performance > degredation due to distribution not matching bandwidth available, > while "weighted interleave" can provide a performance increase. > > For example, the stream benchmark demonstrates that default interleave > is actively harmful, where weighted interleave is beneficial. > > Hardware: 1-socket 8 channel DDR5 + 1 CXL expander in PCIe x16 > Default interleave : -78% (slower than DRAM) > Global weighting : -6% to +4% (workload dependant) > Targeted weights : +2.5% to +4% (consistently better than DRAM) > > If nothing else, this shows how awful round-robin interleave is. I guess the performance of the default policy, local (fast memory) first, may be even better in some situation? For example, before the bandwidth of DRAM is saturated? I understand that you may want to limit the memory usage of the fast memory too. But IMHO, that is another requirements. That should be enforced by something like per-node memory limit. > Rather than implement yet another specific syscall to set one > particular field of a mempolicy, we chose to implement an extensible > mempolicy interface so that future extensions can be captured. > > To implement weighted interleave, we need an interface to set the > node weights along with a MPOL_WEIGHTED_INTERLEAVE. We implement a > a sysfs extension for "system global" weights which can be set by > a daemon or administrator, and new extensible syscalls (mempolicy2, > mbind2) which allow task-local weights to be set via user-software. > > The benefit of the sysfs extension is that MPOL_WEIGHTED_INTERLEAVE > can be used by the existing set_mempolicy and mbind via numactl. > > There are 3 "phases" in the patch set that could be considered > for separate merge candidates, but are presented here as a single > line as the goal is a fully functional MPOL_WEIGHTED_INTERLEAVE. > > 1) Implement MPOL_WEIGHTED_INTERLEAVE with a sysfs extension for > setting system-global weights via sysfs. > (Patches 1 & 2) > > 2) Refactor mempolicy creation mechanism to use an extensible arg > struct `struct mempolicy_args` to promote code re-use between > the original mempolicy/mbind interfaces and the new interfaces. > (Patches 3-6) > > 3) Implementation of set_mempolicy2, get_mempolicy2, and mbind2, > along with the addition of task-local weights so that per-task > weights can be registered for MPOL_WEIGHTED_INTERLEAVE. > (Patches 7-11) > > Included at the bottom of this cover letter is linux test project > tests for backward and forward compatibility, some sample software > which can be used for quick tests, as well as a numactl branch > which implements `numactl -w --interleave` for testing. > > = Performance summary = > (tests may have different configurations, see extended info below) > 1) MLC (W2) : +38% over DRAM. +264% over default interleave. > MLC (W5) : +40% over DRAM. +226% over default interleave. > 2) Stream : -6% to +4% over DRAM, +430% over default interleave. > 3) XSBench : +19% over DRAM. +47% over default interleave. > > = LTP Testing Summary = > existing mempolicy & mbind tests: pass > mempolicy & mbind + weighted interleave (global weights): pass > mempolicy2 & mbind2 + weighted interleave (global weights): pass > mempolicy2 & mbind2 + weighted interleave (local weights): pass > [snip] > ===================================================================== > (Patches 3-6) Refactoring mempolicy for code-reuse > > To avoid multiple paths of mempolicy creation, we should refactor the > existing code to enable the designed extensibility, and refactor > existing users to utilize the new interface (while retaining the > existing userland interface). > > This set of patches introduces a new mempolicy_args structure, which > is used to more fully describe a requested mempolicy - to include > existing and future extensions. > > /* > * Describes settings of a mempolicy during set/get syscalls and > * kernel internal calls to do_set_mempolicy() > */ > struct mempolicy_args { > unsigned short mode; /* policy mode */ > unsigned short mode_flags; /* policy mode flags */ > int home_node; /* mbind: use MPOL_MF_HOME_NODE */ > nodemask_t *policy_nodes; /* get/set/mbind */ > unsigned char *il_weights; /* for mode MPOL_WEIGHTED_INTERLEAVE */ > }; According to https://www.geeksforgeeks.org/difference-between-argument-and-parameter-in-c-c-with-examples/ it appears that "parameter" are better than "argument" for struct name here. It appears that current kernel source supports this too. $ grep 'struct[\t ]\+[a-zA-Z0-9]\+_param' -r include/linux | wc -l 411 $ grep 'struct[\t ]\+[a-zA-Z0-9]\+_arg' -r include/linux | wc -l 25 > This arg structure will eventually be utilized by the following > interfaces: > mpol_new() - new mempolicy creation > do_get_mempolicy() - acquiring information about mempolicy > do_set_mempolicy() - setting the task mempolicy > do_mbind() - setting a vma mempolicy > > do_get_mempolicy() is completely refactored to break it out into > separate functionality based on the flags provided by get_mempolicy(2) > MPOL_F_MEMS_ALLOWED: acquires task->mems_allowed > MPOL_F_ADDR: acquires information on vma policies > MPOL_F_NODE: changes the output for the policy arg to node info > > We refactor the get_mempolicy syscall flatten the logic based on these > flags, and aloow for set_mempolicy2() to re-use the underlying logic. > > The result of this refactor, and the new mempolicy_args structure, is > that extensions like 'sys_set_mempolicy_home_node' can now be directly > integrated into the initial call to 'set_mempolicy2', and that more > complete information about a mempolicy can be returned with a single > call to 'get_mempolicy2', rather than multiple calls to 'get_mempolicy' > > > ===================================================================== > (Patches 7-10) set_mempolicy2, get_mempolicy2, mbind2 > > These interfaces are the 'extended' counterpart to their relatives. > They use the userland 'struct mpol_args' structure to communicate a > complete mempolicy configuration to the kernel. This structure > looks very much like the kernel-internal 'struct mempolicy_args': > > struct mpol_args { > /* Basic mempolicy settings */ > __u16 mode; > __u16 mode_flags; > __s32 home_node; > __u64 pol_maxnodes; I understand that we want to avoid hole in struct. But I still feel uncomfortable to use __u64 for a small. But I don't have solution too. Anyone else has some idea? > __aligned_u64 pol_nodes; > __aligned_u64 *il_weights; /* of size pol_maxnodes */ Typo? Should be, __aligned_u64 il_weights; /* of size pol_maxnodes */ ? Found this in some patch descriptions too. > }; > > The basic mempolicy settings which are shared across all interfaces > are captured at the top of the structure, while extensions such as > 'policy_node' and 'addr' are collected beneath. > > The syscalls are uniform and defined as follows: > > long sys_mbind2(unsigned long addr, unsigned long len, > struct mpol_args *args, size_t usize, > unsigned long flags); > > long sys_get_mempolicy2(struct mpol_args *args, size_t size, > unsigned long addr, unsigned long flags); > > long sys_set_mempolicy2(struct mpol_args *args, size_t size, > unsigned long flags); > > The 'flags' argument for mbind2 is the same as 'mbind', except with > the addition of MPOL_MF_HOME_NODE to denote whether the 'home_node' > field should be utilized. > > The 'flags' argument for get_mempolicy2 allows for MPOL_F_ADDR to > allow operating on VMA policies, but MPOL_F_NODE and MPOL_F_MEMS_ALLOWED > behavior has been omitted, since get_mempolicy() provides this already. I still think that it's a good idea to make it possible to deprecate get_mempolicy(). How about use a union as follows? struct mpol_mems_allowed { __u64 maxnodes; __aligned_u64 nodemask; }; union mpol_info { struct mpol_args args; struct mpol_mems_allowed mems_allowed; __s32 node; }; > The 'flags' argument is not used by 'set_mempolicy' at this time, but > may end up allowing the use of MPOL_MF_HOME_NODE if such functionality > is desired. > > The extensions can be summed up as follows: > > get_mempolicy2 extensions: > - mode and mode flags are split into separate fields > - MPOL_F_MEMS_ALLOWED and MPOL_F_NODE are not supported > > set_mempolicy2: > - task-local interleave weights can be set via 'il_weights' > > mbind2: > - home_node field sets policy home node w/ MPOL_MF_HOME_NODE > - task-local interleave weights can be set via 'il_weights' > -- Best Regards, Huang, Ying [snip]