On 2015/8/4 11:36, Tang Chen wrote: > Hi TJ, > > Sorry for the late reply. > > On 07/16/2015 05:48 AM, Tejun Heo wrote: >> ...... >> so in initialization pharse makes no sense any more. The best near online >> node for each cpu should be cached somewhere. >> I'm not really following. Is this because the now offline node can >> later come online and we'd have to break the constant mapping >> invariant if we update the mapping later? If so, it'd be nice to >> spell that out. > > Yes. Will document this in the next version. > >>> ...... >>> +int get_near_online_node(int node) >>> +{ >>> + return per_cpu(x86_cpu_to_near_online_node, >>> + cpumask_first(&node_to_cpuid_mask_map[node])); >>> +} >>> +EXPORT_SYMBOL(get_near_online_node); >> Umm... this function is sitting on a fairly hot path and scanning a >> cpumask each time. Why not just build a numa node -> numa node array? > > Indeed. Will avoid to scan a cpumask. > >> ...... >> >>> static inline struct page *alloc_pages_exact_node(int nid, gfp_t >>> gfp_mask, >>> unsigned int order) >>> { >>> - VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid)); >>> + VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); >>> + >>> +#if IS_ENABLED(CONFIG_X86) && IS_ENABLED(CONFIG_NUMA) >>> + if (!node_online(nid)) >>> + nid = get_near_online_node(nid); >>> +#endif >>> return __alloc_pages(gfp_mask, order, node_zonelist(nid, >>> gfp_mask)); >>> } >> Ditto. Also, what's the synchronization rules for NUMA node >> on/offlining. If you end up updating the mapping later, how would >> that be synchronized against the above usages? > > I think the near online node map should be updated when node online/offline > happens. But about this, I think the current numa code has a little > problem. > > As you know, firmware info binds a set of CPUs and memory to a node. But > at boot time, if the node has no memory (a memory-less node) , it won't > be online. > But the CPUs on that node is available, and bound to the near online node. > (Here, I mean numa_set_node(cpu, node).) > > Why does the kernel do this ? I think it is used to ensure that we can > allocate memory > successfully by calling functions like alloc_pages_node() and > alloc_pages_exact_node(). > By these two fuctions, any CPU should be bound to a node who has memory > so that > memory allocation can be successful. > > That means, for a memory-less node at boot time, CPUs on the node is > online, > but the node is not online. > > That also means, "the node is online" equals to "the node has memory". > Actually, there > are a lot of code in the kernel is using this rule. > > > But, > 1) in cpu_up(), it will try to online a node, and it doesn't check if > the node has memory. > 2) in try_offline_node(), it offlines CPUs first, and then the memory. > > This behavior looks a little wired, or let's say it is ambiguous. It > seems that a NUMA node > consists of CPUs and memory. So if the CPUs are online, the node should > be online. Hi Chen, I have posted a patch set to enable memoryless node on x86, will repost it for review:) Hope it help to solve this issue. Thanks! Gerry > > And also, > The main purpose of this patch-set is to make the cpuid <-> nodeid > mapping persistent. > After this patch-set, alloc_pages_node() and alloc_pages_exact_node() > won't depend on > cpuid <-> nodeid mapping any more. So the node should be online if the > CPUs on it are > online. Otherwise, we cannot setup interfaces of CPUs under /sys. > > > Unfortunately, since I don't have a machine a with memory-less node, I > cannot reproduce > the problem right now. > > How do you think the node online behavior should be changed ? > > Thanks. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html