On Tue, Dec 22, 2015 at 03:04:48PM +0530, Ganapatrao Kulkarni wrote: > On Fri, Dec 18, 2015 at 12:00 AM, Ganapatrao Kulkarni > <gpkulkarni@xxxxxxxxx> wrote: > > On Thu, Dec 17, 2015 at 10:41 PM, Will Deacon <will.deacon@xxxxxxx> wrote: > >> This all looks pretty reasonable, but I'd like to see an Ack from a > >> devicetree maintainer on the binding before I merge anything (and I see > >> that there are outstanding comments from Rutland on that). > > IIUC, there are no open comments for the binding. > > Mark Rutland: please let me know, if there any open comments. > > otherwise, can you please Ack the binding. > >> > >> On Tue, Nov 17, 2015 at 10:50:40PM +0530, Ganapatrao Kulkarni wrote: > >>> Adding numa support for arm64 based platforms. > >>> This patch adds by default the dummy numa node and > >>> maps all memory and cpus to node 0. > >>> using this patch, numa can be simulated on single node arm64 platforms. > >>> > >>> Reviewed-by: Robert Richter <rrichter@xxxxxxxxxx> > >>> Signed-off-by: Ganapatrao Kulkarni <gkulkarni@xxxxxxxxxxxxxxxxxx> > >>> --- > >>> arch/arm64/Kconfig | 25 +++ > >>> arch/arm64/include/asm/mmzone.h | 17 ++ > >>> arch/arm64/include/asm/numa.h | 47 +++++ > >>> arch/arm64/kernel/setup.c | 4 + > >>> arch/arm64/kernel/smp.c | 2 + > >>> arch/arm64/mm/Makefile | 1 + > >>> arch/arm64/mm/init.c | 30 +++- > >>> arch/arm64/mm/numa.c | 384 ++++++++++++++++++++++++++++++++++++++++ > >>> 8 files changed, 506 insertions(+), 4 deletions(-) > >>> create mode 100644 arch/arm64/include/asm/mmzone.h > >>> create mode 100644 arch/arm64/include/asm/numa.h > >>> create mode 100644 arch/arm64/mm/numa.c > >>> > >>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > >>> index 9ac16a4..7d8fb42 100644 > >>> --- a/arch/arm64/Kconfig > >>> +++ b/arch/arm64/Kconfig > >>> @@ -71,6 +71,7 @@ config ARM64 > >>> select HAVE_GENERIC_DMA_COHERENT > >>> select HAVE_HW_BREAKPOINT if PERF_EVENTS > >>> select HAVE_MEMBLOCK > >>> + select HAVE_MEMBLOCK_NODE_MAP if NUMA > >>> select HAVE_PATA_PLATFORM > >>> select HAVE_PERF_EVENTS > >>> select HAVE_PERF_REGS > >>> @@ -482,6 +483,30 @@ config HOTPLUG_CPU > >>> Say Y here to experiment with turning CPUs off and on. CPUs > >>> can be controlled through /sys/devices/system/cpu. > >>> > >>> +# Common NUMA Features > >>> +config NUMA > >>> + bool "Numa Memory Allocation and Scheduler Support" > >>> + depends on SMP > >>> + help > >>> + Enable NUMA (Non Uniform Memory Access) support. > >>> + > >>> + The kernel will try to allocate memory used by a CPU on the > >>> + local memory controller of the CPU and add some more > >>> + NUMA awareness to the kernel. > >> > >> I appreciate that this is copied from x86, but what exactly do you mean > >> by "memory controller" here? > > ok, it is fair enough to say local memory. > >> > >>> diff --git a/arch/arm64/include/asm/mmzone.h b/arch/arm64/include/asm/mmzone.h > >>> new file mode 100644 > >>> index 0000000..6ddd468 > >>> --- /dev/null > >>> +++ b/arch/arm64/include/asm/mmzone.h > >>> @@ -0,0 +1,17 @@ > >>> +#ifndef __ASM_ARM64_MMZONE_H_ > >>> +#define __ASM_ARM64_MMZONE_H_ > >> > >> Please try to follow the standard naming for header guards under arm64 > >> (yes, it's not perfect, but we've made some effort for consistency). > > sure, will follow as in other code. > >> > >>> + > >>> +#ifdef CONFIG_NUMA > >>> + > >>> +#include <linux/mmdebug.h> > >>> +#include <linux/types.h> > >>> + > >>> +#include <asm/smp.h> > >>> +#include <asm/numa.h> > >>> + > >>> +extern struct pglist_data *node_data[]; > >>> + > >>> +#define NODE_DATA(nid) (node_data[(nid)]) > >> > >> This is the same as m32r, metag, parisc, powerpc, s390, sh, sparc, tile > >> and x86. Can we make this the default in the core code instead and then > >> replace this header file with asm-generic or something? > > IIUC, it is same in most but not in all arch. Yes, so we should make the core code take on the behaviour that most architectures want, and then allow the few remaining architectures that need to do something differently override those macros. We do this all over the place in the kernel. > >> > >>> + > >>> +#endif /* CONFIG_NUMA */ > >>> +#endif /* __ASM_ARM64_MMZONE_H_ */ > >>> diff --git a/arch/arm64/include/asm/numa.h b/arch/arm64/include/asm/numa.h > >>> new file mode 100644 > >>> index 0000000..c00f3a4 > >>> --- /dev/null > >>> +++ b/arch/arm64/include/asm/numa.h > >>> @@ -0,0 +1,47 @@ > >>> +#ifndef _ASM_NUMA_H > >>> +#define _ASM_NUMA_H > >> > >> Same comment on the guards. > > ok > >> > >>> +#include <linux/nodemask.h> > >>> +#include <asm/topology.h> > >>> + > >>> +#ifdef CONFIG_NUMA > >>> + > >>> +#define NR_NODE_MEMBLKS (MAX_NUMNODES * 2) > >> > >> This is only used by the ACPI code afaict, so maybe include it when you > >> add that? > > ok > >> > >>> +#define ZONE_ALIGN (1UL << (MAX_ORDER + PAGE_SHIFT)) > >> > >> Where is this used? > > sorry, was used in v6, missed to delete. > >> > >>> + > >>> +/* currently, arm64 implements flat NUMA topology */ > >>> +#define parent_node(node) (node) > >>> + > >>> +extern int __node_distance(int from, int to); > >>> +#define node_distance(a, b) __node_distance(a, b) > >>> + > >>> +/* dummy definitions for pci functions */ > >>> +#define pcibus_to_node(node) 0 > >>> +#define cpumask_of_pcibus(bus) 0 > >> > >> There's a bunch of these dummy definitions already available in > >> asm-generic/topology.h. Can we use those instead of rolling our own > >> please? > these are dummy in this patch, and i will post separate patch for numa-pci. So you're doing to use the generic definitions? > >> This is certainly more readable then the non-numa zone_sizes_init. Is > >> there a reason we can't always select HAVE_MEMBLOCK_NODE_MAP and avoid > >> having to handle the zone holds explicitly? > > yes, i can think off to have select HAVE_MEMBLOCK_NODE_MAP always > > instead of for only numa. > > i can experiment to have the zone_sizes_init of numa to non-numa case > > also and delete the current zone_sizes_init. > if i enable HAVE_MEMBLOCK_NODE_MAP for non-numa case, i see issues in > sparse_init and could be some dependency due to explicit call of > memory_present(note sure) in non-numa. > IMO, this needs to be worked out as a separate patch. Ok, please can you investigate and send a separate patch then? > >>> +#ifdef CONFIG_DEBUG_PER_CPU_MAPS > >>> +/* > >>> + * Returns a pointer to the bitmask of CPUs on Node 'node'. > >>> + */ > >>> +const struct cpumask *cpumask_of_node(int node) > >>> +{ > >>> + > >>> + if (WARN_ON(node >= nr_node_ids)) > >>> + return cpu_none_mask; > >>> + > >>> + if (WARN_ON(node_to_cpumask_map[node] == NULL)) > >>> + return cpu_online_mask; > >>> + > >>> + return node_to_cpumask_map[node]; > >>> +} > >>> +EXPORT_SYMBOL(cpumask_of_node); > >>> +#endif > >>> + > >>> +static void map_cpu_to_node(unsigned int cpu, int nid) > >>> +{ > >>> + set_cpu_numa_node(cpu, nid); > >>> + if (nid >= 0) > >>> + cpumask_set_cpu(cpu, node_to_cpumask_map[nid]); > >>> +} > >>> + > >>> +static void unmap_cpu_to_node(unsigned int cpu) > >>> +{ > >>> + int nid = cpu_to_node(cpu); > >>> + > >>> + if (nid >= 0) > >>> + cpumask_clear_cpu(cpu, node_to_cpumask_map[nid]); > >>> + set_cpu_numa_node(cpu, NUMA_NO_NODE); > >>> +} > >>> + > >>> +void numa_clear_node(unsigned int cpu) > >>> +{ > >>> + unmap_cpu_to_node(cpu); > >>> +} > >>> + > >>> +/* > >>> + * Allocate node_to_cpumask_map based on number of available nodes > >>> + * Requires node_possible_map to be valid. > >>> + * > >>> + * Note: cpumask_of_node() is not valid until after this is done. > >>> + * (Use CONFIG_DEBUG_PER_CPU_MAPS to check this.) > >>> + */ > >>> +static void __init setup_node_to_cpumask_map(void) > >>> +{ > >>> + unsigned int cpu; > >>> + int node; > >>> + > >>> + /* setup nr_node_ids if not done yet */ > >>> + if (nr_node_ids == MAX_NUMNODES) > >>> + setup_nr_node_ids(); > >>> + > >>> + /* allocate and clear the mapping */ > >>> + for (node = 0; node < nr_node_ids; node++) { > >>> + alloc_bootmem_cpumask_var(&node_to_cpumask_map[node]); > >>> + cpumask_clear(node_to_cpumask_map[node]); > >>> + } > >>> + > >>> + for_each_possible_cpu(cpu) > >>> + set_cpu_numa_node(cpu, NUMA_NO_NODE); > >>> + > >>> + /* cpumask_of_node() will now work */ > >>> + pr_debug("Node to cpumask map for %d nodes\n", nr_node_ids); > >>> +} > >>> + > >>> +/* > >>> + * Set the cpu to node and mem mapping > >>> + */ > >>> +void numa_store_cpu_info(unsigned int cpu) > >>> +{ > >>> + map_cpu_to_node(cpu, numa_off ? 0 : cpu_to_node_map[cpu]); > >>> +} > >>> + > >>> +/** > >>> + * numa_add_memblk - Set node id to memblk > >>> + * @nid: NUMA node ID of the new memblk > >>> + * @start: Start address of the new memblk > >>> + * @size: Size of the new memblk > >>> + * > >>> + * RETURNS: > >>> + * 0 on success, -errno on failure. > >>> + */ > >>> +int __init numa_add_memblk(int nid, u64 start, u64 size) > >>> +{ > >>> + int ret; > >>> + > >>> + ret = memblock_set_node(start, size, &memblock.memory, nid); > >>> + if (ret < 0) { > >>> + pr_err("NUMA: memblock [0x%llx - 0x%llx] failed to add on node %d\n", > >>> + start, (start + size - 1), nid); > >>> + return ret; > >>> + } > >>> + > >>> + node_set(nid, numa_nodes_parsed); > >>> + pr_info("NUMA: Adding memblock [0x%llx - 0x%llx] on node %d\n", > >>> + start, (start + size - 1), nid); > >>> + return ret; > >>> +} > >>> +EXPORT_SYMBOL(numa_add_memblk); > >>> + > >>> +/* Initialize NODE_DATA for a node on the local memory */ > >>> +static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn) > >>> +{ > >>> + const size_t nd_size = roundup(sizeof(pg_data_t), SMP_CACHE_BYTES); > >>> + u64 nd_pa; > >>> + void *nd; > >>> + int tnid; > >>> + > >>> + pr_info("Initmem setup node %d [mem %#010Lx-%#010Lx]\n", > >>> + nid, start_pfn << PAGE_SHIFT, > >>> + (end_pfn << PAGE_SHIFT) - 1); > >>> + > >>> + nd_pa = memblock_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid); > >>> + nd = __va(nd_pa); > >>> + > >>> + /* report and initialize */ > >>> + pr_info(" NODE_DATA [mem %#010Lx-%#010Lx]\n", > >>> + nd_pa, nd_pa + nd_size - 1); > >>> + tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT); > >>> + if (tnid != nid) > >>> + pr_info(" NODE_DATA(%d) on node %d\n", nid, tnid); > >>> + > >>> + node_data[nid] = nd; > >>> + memset(NODE_DATA(nid), 0, sizeof(pg_data_t)); > >>> + NODE_DATA(nid)->node_id = nid; > >>> + NODE_DATA(nid)->node_start_pfn = start_pfn; > >>> + NODE_DATA(nid)->node_spanned_pages = end_pfn - start_pfn; > >>> +} > >>> + > >>> +/** > >>> + * numa_reset_distance - Reset NUMA distance table > >>> + * > >>> + * The current table is freed. > >>> + * The next numa_set_distance() call will create a new one. > >>> + */ > >>> +void __init numa_reset_distance(void) > >>> +{ > >>> + size_t size; > >>> + > >>> + if (!numa_distance) > >>> + return; > >>> + > >>> + size = numa_distance_cnt * numa_distance_cnt * > >>> + sizeof(numa_distance[0]); > >>> + > >>> + memblock_free(__pa(numa_distance), size); > >>> + numa_distance_cnt = 0; > >>> + numa_distance = NULL; > >>> +} > >>> + > >>> +static int __init numa_alloc_distance(void) > >>> +{ > >>> + size_t size; > >>> + u64 phys; > >>> + int i, j; > >>> + > >>> + size = nr_node_ids * nr_node_ids * sizeof(numa_distance[0]); > >>> + phys = memblock_find_in_range(0, PFN_PHYS(max_pfn), > >>> + size, PAGE_SIZE); > >>> + if (WARN_ON(!phys)) > >>> + return -ENOMEM; > >>> + > >>> + memblock_reserve(phys, size); > >>> + > >>> + numa_distance = __va(phys); > >>> + numa_distance_cnt = nr_node_ids; > >>> + > >>> + /* fill with the default distances */ > >>> + for (i = 0; i < numa_distance_cnt; i++) > >>> + for (j = 0; j < numa_distance_cnt; j++) > >>> + numa_distance[i * numa_distance_cnt + j] = i == j ? > >>> + LOCAL_DISTANCE : REMOTE_DISTANCE; > >>> + > >>> + pr_debug("NUMA: Initialized distance table, cnt=%d\n", > >>> + numa_distance_cnt); > >>> + > >>> + return 0; > >>> +} > >>> + > >>> +/** > >>> + * numa_set_distance - Set NUMA distance from one NUMA to another > >>> + * @from: the 'from' node to set distance > >>> + * @to: the 'to' node to set distance > >>> + * @distance: NUMA distance > >>> + * > >>> + * Set the distance from node @from to @to to @distance. If distance table > >>> + * doesn't exist, one which is large enough to accommodate all the currently > >>> + * known nodes will be created. > >>> + * > >>> + * If such table cannot be allocated, a warning is printed and further > >>> + * calls are ignored until the distance table is reset with > >>> + * numa_reset_distance(). > >>> + * > >>> + * If @from or @to is higher than the highest known node or lower than zero > >>> + * at the time of table creation or @distance doesn't make sense, the call > >>> + * is ignored. > >>> + * This is to allow simplification of specific NUMA config implementations. > >>> + */ > >>> +void __init numa_set_distance(int from, int to, int distance) > >>> +{ > >>> + if (!numa_distance) > >>> + return; > >>> + > >>> + if (from >= numa_distance_cnt || to >= numa_distance_cnt || > >>> + from < 0 || to < 0) { > >>> + pr_warn_once("NUMA: Warning: node ids are out of bound, from=%d to=%d distance=%d\n", > >>> + from, to, distance); > >>> + return; > >>> + } > >>> + > >>> + if ((u8)distance != distance || > >>> + (from == to && distance != LOCAL_DISTANCE)) { > >>> + pr_warn_once("NUMA: Warning: invalid distance parameter, from=%d to=%d distance=%d\n", > >>> + from, to, distance); > >>> + return; > >>> + } > >>> + > >>> + numa_distance[from * numa_distance_cnt + to] = distance; > >>> +} > >>> +EXPORT_SYMBOL(numa_set_distance); > >>> + > >>> +int __node_distance(int from, int to) > >>> +{ > >>> + if (from >= numa_distance_cnt || to >= numa_distance_cnt) > >>> + return from == to ? LOCAL_DISTANCE : REMOTE_DISTANCE; > >>> + return numa_distance[from * numa_distance_cnt + to]; > >>> +} > >>> +EXPORT_SYMBOL(__node_distance); > >> > >> Much of this is simply a direct copy/paste from x86. Why can't it be > >> moved to common code? I don't see anything arch-specific here. > > not same for all arch. See my earlier comment. Not having identical code for all architectures doesn't mean it's not worth having a common portion of that in core code. It makes it easier to maintain, easier to use and easier to extend. > >>> +/** > >>> + * dummy_numa_init - Fallback dummy NUMA init > >>> + * > >>> + * Used if there's no underlying NUMA architecture, NUMA initialization > >>> + * fails, or NUMA is disabled on the command line. > >>> + * > >>> + * Must online at least one node and add memory blocks that cover all > >>> + * allowed memory. This function must not fail. > >> > >> Why can't it fail? It looks like the return value is ignored by numa_init. > > this function adds all memblocks to node 0. which is unlikely that it will fail. Good thing it returns an int, then. Will -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html