Re: [RFC PATCH 4/4] arm64:numa: adding numa support for arm64 platforms.

Ganapatrao Kulkarni <gpkulkarni@xxxxxxxxx> · Fri, 17 Oct 2014 22:49:56 +0530

Hi All,

On Mon, Oct 6, 2014 at 11:22 PM, Ganapatrao Kulkarni
<gpkulkarni@xxxxxxxxx> wrote:
> On Mon, Oct 6, 2014 at 4:56 PM, Mark Rutland <mark.rutland@xxxxxxx> wrote:
>> On Mon, Oct 06, 2014 at 06:14:36AM +0100, Ganapatrao Kulkarni wrote:
>>> Hi Mark,
>>>
>>> On Fri, Oct 3, 2014 at 5:43 PM, Mark Rutland <mark.rutland@xxxxxxx> wrote:
>>> > On Thu, Sep 25, 2014 at 10:03:59AM +0100, Ganapatrao Kulkarni wrote:
>>> >> Adding numa support for arm64 based platforms.
>>> >> This version creates numa mapping by parsing the dt table.
>>> >> cpu to node id mapping is derived from cluster_id as defined in cpu-map.
>>> >> memory to node id mapping is derived from nid property of memory node.
>>> >
>>> > [...]
>>> >
>>> >> +/*
>>> >> + * Too small node sizes may confuse the VM badly. Usually they
>>> >> + * result from BIOS bugs. So dont recognize nodes as standalone
>>> >> + * NUMA entities that have less than this amount of RAM listed:
>>> >> + */
>>> >> +#define NODE_MIN_SIZE (4*1024*1024)
>>> >
>>> > Why do these confuse the VM? what does BIOS have to do with arm64?
>>> sneaked in from x86, will remove this.
>>> >
>>> >> +
>>> >> +#define parent_node(node)      (node)
>>> >
>>> > Huh?
>>> for thunder, no hierarchy at numa nodes. shall i put under ifdef or
>>> separate header file?
>>
>> I was confused by a node being its own parent, but that seems to be the
>> case elsewhere for parent_node() implementations. Please at least have a
>> comment that we're assuming a flat hierarchy (for now).
> sure, will add the required comments.
>>
>>> >
>>> > [...]
>>> >
>>> >> @@ -168,6 +191,11 @@ void __init bootmem_init(void)
>>> >>         min = PFN_UP(memblock_start_of_DRAM());
>>> >>         max = PFN_DOWN(memblock_end_of_DRAM());
>>> >>
>>> >> +       high_memory = __va((max << PAGE_SHIFT) - 1) + 1;
>>> >> +       max_pfn = max_low_pfn = max;
>>> >> +
>>> >> +       if (IS_ENABLED(CONFIG_NUMA))
>>> >> +               arm64_numa_init();
>>> >
>>> > Is this function defined if !CONFIG_NUMA? Surely it must do nothing in
>>> > that case anyway?
>>> yes, if is not required, will remove it.
>>> >
>>> > [...]
>>> >
>>> >> +/*
>>> >> + *  Set the cpu to node and mem mapping
>>> >> + */
>>> >> +void numa_store_cpu_info(cpu)
>>> >> +{
>>> >> +       cpu_to_node_map[cpu] = cpu_topology[cpu].cluster_id;
>>> >> +       cpumask_set_cpu(cpu, node_to_cpumask_map[cpu_to_node_map[cpu]]);
>>> >> +       set_numa_node(cpu_to_node_map[cpu]);
>>> >> +       set_numa_mem(local_memory_node(cpu_to_node_map[cpu]));
>>> >> +}
>>> >
>>> > I don't like this. I think we need to be more explicit in the DT w.r.t.
>>> > the relationship between memory and the CPU hierarchy.
>>> >
>>> > I can imagine that we might end up with systems with multiple levels of
>>> > NUMA hierarchy (using MPIDR_EL1.Aff{3,2}), and I'd rather that we were
>>> > explcit as possible from the start w.r.t. the relationship between
>>> > memory and groups of CPUs such that we don't end up with multiple ways
>>> > of specifying said relationship, and horrible edge cases around implicit
>>> > definitions (e.g. the nid to cluster mapping).
>>> are you recomending to have explicit nid attribute to each cpu device node?
>>
>> I am recommending that we make the relationship explicit. If anything,
>> using the cpu-map (with phandles) seems like a better approach.
> yes, will add the mapping.
>>
>>> >> +/**
>>> >> + * dummy_numa_init - Fallback dummy NUMA init
>>> >> + *
>>> >> + * Used if there's no underlying NUMA architecture, NUMA initialization
>>> >> + * fails, or NUMA is disabled on the command line.
>>> >> + *
>>> >> + * Must online at least one node and add memory blocks that cover all
>>> >> + * allowed memory.  This function must not fail.
>>> >> + */
>>> >> +static int __init dummy_numa_init(void)
>>> >> +{
>>> >> +       pr_info("%s\n",
>>> >> +              numa_off ? "NUMA turned off" : "No NUMA configuration found");
>>> >
>>> > Why not print "NUMA turned off" in numa_setup?
>>> enters this function only when, numa is turned off or the DT/ACPI
>>> based numa init fails.
>>
>> Sure. Moving the "NUMA turned off" print into numa_setup would mean you
>> could just print "Using dummy NUMA layout" or something to that effect
>> here -- the function has no need to care about the value of numa_off.
> agreed.
>>
>>> >
>>> >> +       pr_info("Faking a node at [mem %#018Lx-%#018Lx]\n",
>>> >> +              0LLU, PFN_PHYS(max_pfn) - 1);
>>> >> +
>>> >> +       node_set(0, numa_nodes_parsed);
>>> >> +       numa_add_memblk(0, 0, PFN_PHYS(max_pfn));
>>> >> +
>>> >> +       return 0;
>>> >> +}
>>> >> +
>>> >> +/**
>>> >> + * early_init_dt_scan_numa_map - parse memory node and map nid to memory range.
>>> >> + */
>>> >> +int __init early_init_dt_scan_numa_map(unsigned long node, const char *uname,
>>> >> +                                    int depth, void *data)
>>> >> +{
>>> >> +       const char *type = of_get_flat_dt_prop(node, "device_type", NULL);
>>> >> +       const __be32 *reg, *endp, *nid_prop;
>>> >> +       int l, nid;
>>> >> +
>>> >> +       /* We are scanning "memory" nodes only */
>>> >> +       if (type == NULL) {
>>> >> +               /*
>>> >> +                * The longtrail doesn't have a device_type on the
>>> >> +                * /memory node, so look for the node called /memory@0.
>>> >> +                */
>>> >> +               if (depth != 1 || strcmp(uname, "memory@0") != 0)
>>> >> +                       return 0;
>>> >
>>> > This has no place on arm64.
>>> i am not sure that we can move to driver/of, at this moment this is
>>> arm64 specific binding.
>>
>> I meant the longtrail workaround, hence my comments on that below.
> oh! thanks, will remove this.
>>
>>> > We limited to longtrail workaround in the core memory parsing to PPC32
>>> > only in commit b44aa25d20e2ef6b (of: Handle memory@0 node on PPC32
>>> > only). This code doesn't need it enabled ever.
>>> >
>>> > Are you booting using UEFI? This isn't going to work when the memory map
>>> tried with bootwrapper, working on to boot from UEFI.
>>> > comes from UEFI and we have no memory nodes in the DTB.
>>> yes, there is issue with UEFI boot, since memory node is removed.
>>> i request UEFI stub dev-team to suggest the possible ways to address this.
>>
>> I've Cc'd a few people who have worked on the stub and/or EFI memory map
>> stuff. It would be worth keeping them on Cc so as to keep them informed.
> thanks.
>>
>> I believe that the EFI stub is doing the right thing by ensuring that
>> the EFI memory map is used, so this is just another configuration that
>> your binding has to consider.
> going through EFI stub, next is to boot numa kernel using UEFI, will
> include UEFI boot support in v2 patch.
>>
>> Mark.
Below is the example for the proposal of numa bindings in DT.
This covers cpu to node mapping, memory ranges to node mapping.
Also defines proximity distance matrix of nodes to each other.
please let me know your comments to go ahead with the implementation.

numa-map{
         /*  Address cells used for memory range base address in mem-map.
             For all others, size-cells is used.
             Node-count tells the number of numa nodes in the system.
         */
         #address-cells = <2>;
         #size-cells = <1>;
         #node-count = <4>;

         /* Memmap for memory ranges on each node>

         mem-map = <0x0   0x00c00000 0>,
           <0x1   0x00000000 1>,
           <0x100 0x00000000 2>,
           <0x200 0x00000000 3>;

         /* CPU to node map for 4 NODE and 16 CPUs system
        < first-cpu last-cpu  node belongs>
         */
         cpu-map = <0 3 0>,
                         <4 7 1>,
                        <8 11 2>,
                        <12 16 3>;

         /*Proximity Distance matrix for 4Node system
                       <from-node to-node distance>
        */
         node-matrix=    <0 0 10>,
                               <0 1 20>,
                               <0 2 30>,
                               <0 3 10>,
                               <1 0 20>,
                                <1 1 10>,
                               <1 2 30>,
                              <1 3 10>,
                              <2 0 30>,
                              <2 1 20>,
                              <2 2 10>,
                             <2 3 10>,
                            <3 0 10>,
                            <3 1 20>,
                            <3 2 30>,
                            <3 3 10>,
 }

thanks
Ganapat
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html