On 7/18/22 12:27 PM, Huang, Ying wrote: > Aneesh Kumar K V <aneesh.kumar@xxxxxxxxxxxxx> writes: > >> On 7/15/22 1:23 PM, Huang, Ying wrote: > > [snip] > >>> >>> You dropped the original sysfs interface patches from the series, but >>> the kernel internal implementation is still for the original sysfs >>> interface. For example, memory tier ID is for the original sysfs >>> interface, not for the new proposed sysfs interface. So I suggest you >>> to implement with the new interface in mind. What do you think about >>> the following design? >>> >> >> Sorry I am not able to follow you here. This patchset completely drops >> exposing memory tiers to userspace via sysfs. Instead it allow >> creation of memory tiers with specific tierID from within the kernel/device driver. >> Default tierID is 200 and dax kmem creates memory tier with tierID 100. >> >> >>> - Each NUMA node belongs to a memory type, and each memory type >>> corresponds to a "abstract distance", so each NUMA node corresonds to >>> a "distance". For simplicity, we can start with static distances, for >>> example, DRAM (default): 150, PMEM: 250. The distance of each NUMA >>> node can be recorded in a global array, >>> >>> int node_distances[MAX_NUMNODES]; >>> >>> or, just >>> >>> pgdat->distance >>> >> >> I don't follow this. I guess you are trying to have a different design. >> Would it be much easier if you can write this in the form of a patch? > > Written some pseudo code as follow to show my basic idea. > > #define MEMORY_TIER_ADISTANCE_DRAM 150 > #define MEMORY_TIER_ADISTANCE_PMEM 250 > > struct memory_tier { > /* abstract distance range covered by the memory tier */ > int adistance_start; > int adistance_len; > struct list_head list; > nodemask_t nodemask; > }; > > /* RCU list of memory tiers */ > static LIST_HEAD(memory_tiers); > > /* abstract distance of each NUMA node */ > int node_adistances[MAX_NUMNODES]; > > struct memory_tier *find_create_memory_tier(int adistance) > { > struct memory_tier *tier; > > list_for_each_entry(tier, &memory_tiers, list) { > if (adistance >= tier->adistance_start && > adistance < tier->adistance_start + tier->adistance_len) > return tier; > } > /* allocate a new memory tier and return */ > } > > void memory_tier_add_node(int nid) > { > int adistance; > struct memory_tier *tier; > > adistance = node_adistances[nid] || MEMORY_TIER_ADISTANCE_DRAM; > tier = find_create_memory_tier(adistance); > node_set(nid, &tier->nodemask); > /* setup demotion data structure, etc */ > } > > static int __meminit migrate_on_reclaim_callback(struct notifier_block *self, > unsigned long action, void *_arg) > { > struct memory_notify *arg = _arg; > int nid; > > nid = arg->status_change_nid; > if (nid < 0) > return notifier_from_errno(0); > > switch (action) { > case MEM_ONLINE: > memory_tier_add_node(nid); > break; > } > > return notifier_from_errno(0); > } > > /* kmem.c */ > static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > { > node_adistances[dev_dax->target_node] = MEMORY_TIER_ADISTANCE_PMEM; > /* add_memory_driver_managed() */ > } > > [snip] > > Best Regards, > Huang, Ying Implementing that I ended up with the below. The difference is adistance_len is not a memory tier property instead it is a kernel parameter like memory_tier_chunk_size which can be tuned to create more memory tiers. How about this? Not yet tested. struct memory_tier { struct list_head list; int id; int perf_level; nodemask_t nodelist; }; static LIST_HEAD(memory_tiers); static DEFINE_MUTEX(memory_tier_lock); static unsigned int default_memtier_perf_level = DEFAULT_MEMORY_TYPE_PERF; core_param(default_memory_tier_perf_level, default_memtier_perf_level, uint, 0644); static unsigned int memtier_perf_chunk_size = 150; core_param(memory_tier_perf_chunk, memtier_perf_chunk_size, uint, 0644); /* * performance levels are grouped into memtiers each of chunk size * memtier_perf_chunk */ static struct memory_tier *find_create_memory_tier(unsigned int perf_level) { bool found_slot = false; struct list_head *ent; struct memory_tier *memtier, *new_memtier; static int next_memtier_id = 0; /* * zero is special in that it indicates uninitialized * perf level by respective driver. Pick default memory * tier perf level for that. */ if (!perf_level) perf_level = default_memtier_perf_level; lockdep_assert_held_once(&memory_tier_lock); list_for_each(ent, &memory_tiers) { memtier = list_entry(ent, struct memory_tier, list); if (perf_level >= memtier->perf_level && perf_level < memtier->perf_level + memtier_perf_chunk_size) return memtier; else if (perf_level < memtier->perf_level) { found_slot = true; break; } } new_memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL); if (!new_memtier) return ERR_PTR(-ENOMEM); new_memtier->id = next_memtier_id++; new_memtier->perf_level = ALIGN_DOWN(perf_level, memtier_perf_chunk_size); if (found_slot) list_add_tail(&new_memtier->list, ent); else list_add_tail(&new_memtier->list, &memory_tiers); return new_memtier; } static int __init memory_tier_init(void) { int node; struct memory_tier *memtier; /* * Since this is early during boot, we could avoid * holding memtory_tier_lock. But keep it simple by * holding locks. So we can add lock held debug checks * in other functions. */ mutex_lock(&memory_tier_lock); memtier = find_create_memory_tier(default_memtier_perf_level); if (IS_ERR(memtier)) panic("%s() failed to register memory tier: %ld\n", __func__, PTR_ERR(memtier)); /* CPU only nodes are not part of memory tiers. */ memtier->nodelist = node_states[N_MEMORY]; /* * nodes that are already online and that doesn't * have perf level assigned is assigned a default perf * level. */ for_each_node_state(node, N_MEMORY) { struct node *node_property = node_devices[node]; if (!node_property->perf_level) node_property->perf_level = default_memtier_perf_level; } mutex_unlock(&memory_tier_lock); return 0; } subsys_initcall(memory_tier_init);