On Mon, Dec 11, 2023 at 01:53:40PM +0800, Huang, Ying wrote: > Hi, Gregory, > > Thanks for updated version! > > Gregory Price <gourry.memverge@xxxxxxxxx> writes: > > > v2: > > changes / adds: > > - flattened weight matrix to an array at requested of Ying Huang > > - Updated ABI docs per Davidlohr Bueso request > > - change uapi structure to use aligned/fixed-length members as > > Suggested-by: Arnd Bergmann <arnd@xxxxxxxx> > > - Implemented weight fetch logic in get_mempolicy2 > > - mbind2 was changed to take (iovec,len) as function arguments > > rather than add them to the uapi structure, since they describe > > where to apply the mempolicy - as opposed to being part of it. > > > > The sysfs structure is designed as follows. > > > > $ tree /sys/kernel/mm/mempolicy/ > > /sys/kernel/mm/mempolicy/ > > ├── possible_nodes > > └── weighted_interleave > > ├── nodeN > > │ └── weight > > └── nodeN+X > > └── weight > > > > 'mempolicy' is added to '/sys/kernel/mm/' as a control group for > > the mempolicy subsystem. > > Is it good to add 'mempolicy' in '/sys/kernel/mm/numa'? The advantage > is that 'mempolicy' here is in fact "NUMA mempolicy". The disadvantage > is one more directory nesting. I have no strong opinion here. > i don't have a strong opinion here. > > 'possible_nodes' is added to 'mm/mempolicy' to help describe the > > expected structures under mempolicy directorys. For example, > > possible_nodes describes what nodeN directories wille exist under > > the weighted_interleave directory. > > We have '/sys/devices/system/node/possible' already. Is this just a > duplication? If so, why? And, the possible nodes can be gotten via > contents of 'weighted_interleave' too. > I'll remove it > And it appears not necessary to make 'weighted_interleave/nodeN' > directory. Why not just make it a file. > Originally I wasn't sure whether there would be more attributes, but this is probably fine. I'll change it. > And, can we add a way to reset weight to the default value? For example > `echo > nodeN/weight` or `echo > nodeN`. > Seems reasonable. > > ===================================================================== > > (Patches 7-10) set_mempolicy2, get_mempolicy2, mbind2 > > > > These interfaces are the 'extended' counterpart to their relatives. > > They use the userland 'struct mpol_args' structure to communicate a > > complete mempolicy configuration to the kernel. This structure > > looks very much like the kernel-internal 'struct mempolicy_args': > > > > struct mpol_args { > > /* Basic mempolicy settings */ > > __u16 mode; > > __u16 mode_flags; > > __s32 home_node; > > __aligned_u64 pol_nodes; > > __u64 pol_maxnodes; > > __u64 addr; > > __s32 policy_node; > > __s32 addr_node; > > __aligned_u64 *il_weights; /* of size pol_maxnodes */ > > }; > > This looks unnecessarily complex. I don't think that it's a good idea > to use exact same parameter for all 3 syscalls. > It is exactly as complex as mempolicy is. Everything here is already described in the existing interfaces (except il_weights). > For example, can we use something as below? > > long set_mempolicy2(int mode, const unsigned long *nodemask, unsigned int *il_weights, > unsigned long maxnode, unsigned long home_node, > unsigned long flags); > > long mbind2(unsigned long start, unsigned long len, > int mode, const unsigned long *nodemask, unsigned int *il_weights, > unsigned long maxnode, unsigned long home_node, > unsigned long flags); > Your definition of mbind2 is impossible. Neither of these interfaces solve the extensibility issue. If a new policy which requires a new format of data arrives, we can look forward to set_mempolicy3 and mbind3. > A struct may be defined to hold mempolicy iteself. > > struct mpol { > int mode; > unsigned int home_node; > const unsigned long *nodemask; > unsigned int *il_weights; > unsigned int maxnode; > }; > addr could be pulled out for get_mempolicy2, so i will do that 'addr_node' and 'policy_node' are warts that came from the original get_mempolicy. Removing them increases the complexity of handling arguments in the common get_mempolicy code. I could probably just drop support for retrieving the addr_node from get_mempolicy2, since it's already possible with get_mempolicy. So I will do that. ~Gregory