Aneesh Kumar K V <aneesh.kumar@xxxxxxxxxxxxx> writes: > On 8/1/22 10:40 AM, Huang, Ying wrote: >> Aneesh Kumar K V <aneesh.kumar@xxxxxxxxxxxxx> writes: >> >>> On 8/1/22 7:36 AM, Huang, Ying wrote: >>>> "Aneesh Kumar K.V" <aneesh.kumar@xxxxxxxxxxxxx> writes: >>>> >>>>> "Huang, Ying" <ying.huang@xxxxxxxxx> writes: >>>>> >>>>>> "Aneesh Kumar K.V" <aneesh.kumar@xxxxxxxxxxxxx> writes: >>>>>> >>>>>>> By default, all nodes are assigned to the default memory tier which >>>>>>> is the memory tier designated for nodes with DRAM >>>>>>> >>>>>>> Set dax kmem device node's tier to slower memory tier by assigning >>>>>>> abstract distance to MEMTIER_ADISTANCE_PMEM. PMEM tier >>>>>>> appears below the default memory tier in demotion order. >>>>>>> >>>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxx> >>>>>>> --- >>>>>>> drivers/dax/kmem.c | 9 +++++++++ >>>>>>> include/linux/memory-tiers.h | 19 ++++++++++++++++++- >>>>>>> mm/memory-tiers.c | 28 ++++++++++++++++------------ >>>>>>> 3 files changed, 43 insertions(+), 13 deletions(-) >>>>>>> >>>>>>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c >>>>>>> index a37622060fff..6b0d5de9a3e9 100644 >>>>>>> --- a/drivers/dax/kmem.c >>>>>>> +++ b/drivers/dax/kmem.c >>>>>>> @@ -11,6 +11,7 @@ >>>>>>> #include <linux/fs.h> >>>>>>> #include <linux/mm.h> >>>>>>> #include <linux/mman.h> >>>>>>> +#include <linux/memory-tiers.h> >>>>>>> #include "dax-private.h" >>>>>>> #include "bus.h" >>>>>>> >>>>>>> @@ -41,6 +42,12 @@ struct dax_kmem_data { >>>>>>> struct resource *res[]; >>>>>>> }; >>>>>>> >>>>>>> +static struct memory_dev_type default_pmem_type = { >>>>>> >>>>>> Why is this named as default_pmem_type? We will not change the memory >>>>>> type of a node usually. >>>>>> >>>>> >>>>> Any other suggestion? pmem_dev_type? >>>> >>>> Or dax_pmem_type? >>>> >>>> DAX is used to enumerate the memory device. >>>> >>>>> >>>>>>> + .adistance = MEMTIER_ADISTANCE_PMEM, >>>>>>> + .tier_sibiling = LIST_HEAD_INIT(default_pmem_type.tier_sibiling), >>>>>>> + .nodes = NODE_MASK_NONE, >>>>>>> +}; >>>>>>> + >>>>>>> static int dev_dax_kmem_probe(struct dev_dax *dev_dax) >>>>>>> { >>>>>>> struct device *dev = &dev_dax->dev; >>>>>>> @@ -62,6 +69,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) >>>>>>> return -EINVAL; >>>>>>> } >>>>>>> >>>>>>> + init_node_memory_type(numa_node, &default_pmem_type); >>>>>>> + >>>>>> >>>>>> The memory hot-add below may fail. So the error handling needs to be >>>>>> added. >>>>>> >>>>>> And, it appears that the memory type and memory tier of a node may be >>>>>> fully initialized here before NUMA hot-adding started. So I suggest to >>>>>> set node_memory_types[] here only. And set memory_dev_type->nodes in >>>>>> node hot-add callback. I think there is the proper place to complete >>>>>> the initialization. >>>>>> >>>>>> And, in theory dax/kmem.c can be unloaded. So we need to clear >>>>>> node_memory_types[] for nodes somewhere. >>>>>> >>>>> >>>>> I guess by module exit we can be sure that all the memory managed >>>>> by dax/kmem is hotplugged out. How about something like below? >>>> >>>> Because we set node_memorty_types[] in dev_dax_kmem_probe(), it's >>>> natural to clear it in dev_dax_kmem_remove(). >>>> >>> >>> Most of required reset/clear is done as part of memory hotunplug. So >>> if we did manage to successfully unplug the memory, everything except >>> node_memory_types[node] should be reset. That makes the clear_node_memory_type >>> the below. >>> >>> void clear_node_memory_type(int node, struct memory_dev_type *memtype) >>> { >>> >>> mutex_lock(&memory_tier_lock); >>> /* >>> * memory unplug did clear the node from the memtype and >>> * dax/kem did initialize this node's memory type. >>> */ >>> if (!node_isset(node, memtype->nodes) && node_memory_types[node] == memtype){ >>> node_memory_types[node] = NULL; >>> } >>> mutex_unlock(&memory_tier_lock); >>> } >>> >>> With the module unload, it is kind of force removing the usage of the specific memtype. >>> Considering module unload will remove the usage of specific memtype from other parts >>> of the kernel and we already do all the required reset in memory hot unplug, do we >>> need to do the clear_node_memory_type above? >> >> Per my understanding, we need to call clear_node_memory_type() in >> dev_dax_kmem_remove(). After that, we have nothing to do in >> dax_kmem_exit(). >> > > Ok, I guess you are suggesting to do the clear_node_memory_type even if we fail the memory remove. Can we use node_memory_types[] to indicate whether a node is managed by a driver? Regardless being succeeded or failed, dev_dax_kmem_remove() will set node_memory_types[] = NULL. But until node is offlined, we will still keep the node in the memory_dev_type (dax_pmem_type). And we will prevent dax/kmem from unloading via try_module_get() and add "struct module *" to struct memory_dev_type. Best Regards, Huang, Ying > Should we also rebuild demotion order? On a successful memory remove we do rebuild demotion order. > This is what i ended up with. > > modified drivers/dax/kmem.c > @@ -171,6 +171,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > static void dev_dax_kmem_remove(struct dev_dax *dev_dax) > { > int i, success = 0; > + int node = dev_dax->target_node; > struct device *dev = &dev_dax->dev; > struct dax_kmem_data *data = dev_get_drvdata(dev); > > @@ -208,6 +209,12 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax) > kfree(data); > dev_set_drvdata(dev, NULL); > } > + /* > + * Clear the memtype association, even if the memory > + * remove failed. > + */ > + clear_node_memory_type(node, dax_pmem_type); > + > } > #else > static void dev_dax_kmem_remove(struct dev_dax *dev_dax) > modified include/linux/memory-tiers.h > @@ -31,6 +31,7 @@ struct memory_dev_type { > #ifdef CONFIG_NUMA > extern bool numa_demotion_enabled; > void init_node_memory_type(int node, struct memory_dev_type *default_type); > +void clear_node_memory_type(int node, struct memory_dev_type *memtype); > #ifdef CONFIG_MIGRATION > int next_demotion_node(int node); > void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); > @@ -57,6 +58,10 @@ static inline bool node_is_toptier(int node) > #define numa_demotion_enabled false > static inline void init_node_memory_type(int node, struct memory_dev_type *default_type) > { > +} > + > +static inline void unregister_memory_type(struct memory_dev_type *memtype) > +{ > > } > > modified mm/memory-tiers.c > @@ -501,6 +501,36 @@ void init_node_memory_type(int node, struct memory_dev_type *default_type) > } > EXPORT_SYMBOL_GPL(init_node_memory_type); > > +void clear_node_memory_type(int node, struct memory_dev_type *memtype) > +{ > + struct memory_tier *memtier; > + > + mutex_lock(&memory_tier_lock); > + /* > + * Even if we fail to unplug memory, clear the association of > + * this node to this specific memory type. > + */ > + if (node_memory_types[node] == memtype) { > + > + memtier = __node_get_memory_tier(node); > + if (memtier) { > + rcu_assign_pointer(pgdat->memtier, NULL); > + synchronize_rcu(); > + } > + node_clear(node, memtype->nodes); > + if (nodes_empty(memtype->nodes)) { > + list_del(&memtype->tier_sibiling); > + memtype->memtier = NULL; > + if (current_memtier && list_empty(¤t_memtier->memory_types)) > + destroy_memory_tier(current_memtier); > + > + } > + node_memory_types[node] = NULL; > + } > + mutex_unlock(&memory_tier_lock); > +} > +EXPORT_SYMBOL_GPL(init_node_memory_type); > + > void update_node_adistance(int node, struct memory_dev_type *memtype) > { > pg_data_t *pgdat; > > [back