On 8/1/22 10:40 AM, Huang, Ying wrote: > Aneesh Kumar K V <aneesh.kumar@xxxxxxxxxxxxx> writes: > >> On 8/1/22 7:36 AM, Huang, Ying wrote: >>> "Aneesh Kumar K.V" <aneesh.kumar@xxxxxxxxxxxxx> writes: >>> >>>> "Huang, Ying" <ying.huang@xxxxxxxxx> writes: >>>> >>>>> "Aneesh Kumar K.V" <aneesh.kumar@xxxxxxxxxxxxx> writes: >>>>> >>>>>> By default, all nodes are assigned to the default memory tier which >>>>>> is the memory tier designated for nodes with DRAM >>>>>> >>>>>> Set dax kmem device node's tier to slower memory tier by assigning >>>>>> abstract distance to MEMTIER_ADISTANCE_PMEM. PMEM tier >>>>>> appears below the default memory tier in demotion order. >>>>>> >>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxx> >>>>>> --- >>>>>> drivers/dax/kmem.c | 9 +++++++++ >>>>>> include/linux/memory-tiers.h | 19 ++++++++++++++++++- >>>>>> mm/memory-tiers.c | 28 ++++++++++++++++------------ >>>>>> 3 files changed, 43 insertions(+), 13 deletions(-) >>>>>> >>>>>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c >>>>>> index a37622060fff..6b0d5de9a3e9 100644 >>>>>> --- a/drivers/dax/kmem.c >>>>>> +++ b/drivers/dax/kmem.c >>>>>> @@ -11,6 +11,7 @@ >>>>>> #include <linux/fs.h> >>>>>> #include <linux/mm.h> >>>>>> #include <linux/mman.h> >>>>>> +#include <linux/memory-tiers.h> >>>>>> #include "dax-private.h" >>>>>> #include "bus.h" >>>>>> >>>>>> @@ -41,6 +42,12 @@ struct dax_kmem_data { >>>>>> struct resource *res[]; >>>>>> }; >>>>>> >>>>>> +static struct memory_dev_type default_pmem_type = { >>>>> >>>>> Why is this named as default_pmem_type? We will not change the memory >>>>> type of a node usually. >>>>> >>>> >>>> Any other suggestion? pmem_dev_type? >>> >>> Or dax_pmem_type? >>> >>> DAX is used to enumerate the memory device. >>> >>>> >>>>>> + .adistance = MEMTIER_ADISTANCE_PMEM, >>>>>> + .tier_sibiling = LIST_HEAD_INIT(default_pmem_type.tier_sibiling), >>>>>> + .nodes = NODE_MASK_NONE, >>>>>> +}; >>>>>> + >>>>>> static int dev_dax_kmem_probe(struct dev_dax *dev_dax) >>>>>> { >>>>>> struct device *dev = &dev_dax->dev; >>>>>> @@ -62,6 +69,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) >>>>>> return -EINVAL; >>>>>> } >>>>>> >>>>>> + init_node_memory_type(numa_node, &default_pmem_type); >>>>>> + >>>>> >>>>> The memory hot-add below may fail. So the error handling needs to be >>>>> added. >>>>> >>>>> And, it appears that the memory type and memory tier of a node may be >>>>> fully initialized here before NUMA hot-adding started. So I suggest to >>>>> set node_memory_types[] here only. And set memory_dev_type->nodes in >>>>> node hot-add callback. I think there is the proper place to complete >>>>> the initialization. >>>>> >>>>> And, in theory dax/kmem.c can be unloaded. So we need to clear >>>>> node_memory_types[] for nodes somewhere. >>>>> >>>> >>>> I guess by module exit we can be sure that all the memory managed >>>> by dax/kmem is hotplugged out. How about something like below? >>> >>> Because we set node_memorty_types[] in dev_dax_kmem_probe(), it's >>> natural to clear it in dev_dax_kmem_remove(). >>> >> >> Most of required reset/clear is done as part of memory hotunplug. So >> if we did manage to successfully unplug the memory, everything except >> node_memory_types[node] should be reset. That makes the clear_node_memory_type >> the below. >> >> void clear_node_memory_type(int node, struct memory_dev_type *memtype) >> { >> >> mutex_lock(&memory_tier_lock); >> /* >> * memory unplug did clear the node from the memtype and >> * dax/kem did initialize this node's memory type. >> */ >> if (!node_isset(node, memtype->nodes) && node_memory_types[node] == memtype){ >> node_memory_types[node] = NULL; >> } >> mutex_unlock(&memory_tier_lock); >> } >> >> With the module unload, it is kind of force removing the usage of the specific memtype. >> Considering module unload will remove the usage of specific memtype from other parts >> of the kernel and we already do all the required reset in memory hot unplug, do we >> need to do the clear_node_memory_type above? > > Per my understanding, we need to call clear_node_memory_type() in > dev_dax_kmem_remove(). After that, we have nothing to do in > dax_kmem_exit(). > Ok, I guess you are suggesting to do the clear_node_memory_type even if we fail the memory remove. Should we also rebuild demotion order? On a successful memory remove we do rebuild demotion order. This is what i ended up with. modified drivers/dax/kmem.c @@ -171,6 +171,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) static void dev_dax_kmem_remove(struct dev_dax *dev_dax) { int i, success = 0; + int node = dev_dax->target_node; struct device *dev = &dev_dax->dev; struct dax_kmem_data *data = dev_get_drvdata(dev); @@ -208,6 +209,12 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax) kfree(data); dev_set_drvdata(dev, NULL); } + /* + * Clear the memtype association, even if the memory + * remove failed. + */ + clear_node_memory_type(node, dax_pmem_type); + } #else static void dev_dax_kmem_remove(struct dev_dax *dev_dax) modified include/linux/memory-tiers.h @@ -31,6 +31,7 @@ struct memory_dev_type { #ifdef CONFIG_NUMA extern bool numa_demotion_enabled; void init_node_memory_type(int node, struct memory_dev_type *default_type); +void clear_node_memory_type(int node, struct memory_dev_type *memtype); #ifdef CONFIG_MIGRATION int next_demotion_node(int node); void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); @@ -57,6 +58,10 @@ static inline bool node_is_toptier(int node) #define numa_demotion_enabled false static inline void init_node_memory_type(int node, struct memory_dev_type *default_type) { +} + +static inline void unregister_memory_type(struct memory_dev_type *memtype) +{ } modified mm/memory-tiers.c @@ -501,6 +501,36 @@ void init_node_memory_type(int node, struct memory_dev_type *default_type) } EXPORT_SYMBOL_GPL(init_node_memory_type); +void clear_node_memory_type(int node, struct memory_dev_type *memtype) +{ + struct memory_tier *memtier; + + mutex_lock(&memory_tier_lock); + /* + * Even if we fail to unplug memory, clear the association of + * this node to this specific memory type. + */ + if (node_memory_types[node] == memtype) { + + memtier = __node_get_memory_tier(node); + if (memtier) { + rcu_assign_pointer(pgdat->memtier, NULL); + synchronize_rcu(); + } + node_clear(node, memtype->nodes); + if (nodes_empty(memtype->nodes)) { + list_del(&memtype->tier_sibiling); + memtype->memtier = NULL; + if (current_memtier && list_empty(¤t_memtier->memory_types)) + destroy_memory_tier(current_memtier); + + } + node_memory_types[node] = NULL; + } + mutex_unlock(&memory_tier_lock); +} +EXPORT_SYMBOL_GPL(init_node_memory_type); + void update_node_adistance(int node, struct memory_dev_type *memtype) { pg_data_t *pgdat; [back