Re: [PATCH] of: return NUMA_NO_NODE from fallback of_node_to_nid()

Nishanth Aravamudan <nacc@xxxxxxxxxxxxxxxxxx> · Fri, 10 Apr 2015 12:48:39 -0700

On 10.04.2015 [14:37:19 +0300], Konstantin Khlebnikov wrote:
> On 10.04.2015 01:58, Tanisha Aravamudan wrote:
> >On 09.04.2015 [07:27:28 +0300], Konstantin Khlebnikov wrote:
> >>On Thu, Apr 9, 2015 at 2:07 AM, Nishanth Aravamudan
> >><nacc@xxxxxxxxxxxxxxxxxx> wrote:
> >>>On 08.04.2015 [20:04:04 +0300], Konstantin Khlebnikov wrote:
> >>>>On 08.04.2015 19:59, Konstantin Khlebnikov wrote:
> >>>>>Node 0 might be offline as well as any other numa node,
> >>>>>in this case kernel cannot handle memory allocation and crashes.
> >>>
> >>>Isn't the bug that numa_node_id() returned an offline node? That
> >>>shouldn't happen.
> >>
> >>Offline node 0 came from static-inline copy of that function from of.h
> >>I've patched weak function for keeping consistency.
> >
> >Got it, that's not necessarily clear in the original commit message.
> 
> Sorry.
> 
> >
> >>>#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
> >>>...
> >>>#ifndef numa_node_id
> >>>/* Returns the number of the current Node. */
> >>>static inline int numa_node_id(void)
> >>>{
> >>>         return raw_cpu_read(numa_node);
> >>>}
> >>>#endif
> >>>...
> >>>#else   /* !CONFIG_USE_PERCPU_NUMA_NODE_ID */
> >>>
> >>>/* Returns the number of the current Node. */
> >>>#ifndef numa_node_id
> >>>static inline int numa_node_id(void)
> >>>{
> >>>         return cpu_to_node(raw_smp_processor_id());
> >>>}
> >>>#endif
> >>>...
> >>>
> >>>So that's either the per-cpu numa_node value, right? Or the result of
> >>>cpu_to_node on the current processor.
> >>>
> >>>>Example:
> >>>>
> >>>>[    0.027133] ------------[ cut here ]------------
> >>>>[    0.027938] kernel BUG at include/linux/gfp.h:322!
> >>>
> >>>This is
> >>>
> >>>VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
> >>>
> >>>in
> >>>
> >>>alloc_pages_exact_node().
> >>>
> >>>And based on the trace below, that's
> >>>
> >>>__slab_alloc -> alloc
> >>>
> >>>alloc_pages_exact_node
> >>>         <- alloc_slab_page
> >>>                 <- allocate_slab
> >>>                         <- new_slab
> >>>                                 <- new_slab_objects
> >>>                                         < __slab_alloc?
> >>>
> >>>which is just passing the node value down, right? Which I think was
> >>>from:
> >>>
> >>>         domain = kzalloc_node(sizeof(*domain) + (sizeof(unsigned int) * size),
> >>>                               GFP_KERNEL, of_node_to_nid(of_node));
> >>>
> >>>?
> >>>
> >>>
> >>>What platform is this on, looks to be x86? qemu emulation of a
> >>>pathological topology? What was the topology?
> >>
> >>qemu x86_64, 2 cpu, 2 numa nodes, all memory in second.
> >
> >Ok, this worked before? That is, this is a regression?
> 
> Seems like that worked before 3.17 where
> bug was exposed by commit 44767bfaaed782d6d635ecbb13f3980041e6f33e
> (x86, irq: Enhance mp_register_ioapic() to support irqdomain)
> this is first usage of  *irq_domain_add*() in x86.

Ok.

> >>  I've slightly patched it to allow that setup (in qemu hardcoded 1Mb
> >>of memory connected to node 0) And i've found unrelated bug --
> >>if numa node has less that 4Mb ram then kernel crashes even
> >>earlier because numa code ignores that node
> >>but buddy allocator still tries to use that pages.
> >
> >So this isn't an actually supported topology by qemu?
> 
> Qemu easily created memoryless numa nodes but node 0 have hardcoded
> 1Mb of ram. This seems like legacy prop for DOS era software.

Well, the problem is that x86 doesn't support memoryless nodes.

git grep MEMORYLESS_NODES
arch/ia64/Kconfig:config HAVE_MEMORYLESS_NODES
arch/powerpc/Kconfig:config HAVE_MEMORYLESS_NODES

> >>>Note that there is a ton of code that seems to assume node 0 is online.
> >>>I started working on removing this assumption myself and it just led
> >>>down a rathole (on power, we always have node 0 online, even if it is
> >>>memoryless and cpuless, as a result).
> >>>
> >>>I am guessing this is just happening early in boot before the per-cpu
> >>>areas are setup? That's why (I think) x86 has the early_cpu_to_node()
> >>>function...
> >>>
> >>>Or do you not have CONFIG_OF set? So isn't the only change necessary to
> >>>the include file, and it should just return first_online_node rather
> >>>than 0?
> >>>
> >>>Ah and there's more of those node 0 assumptions :)
> >>
> >>That was x86 where is no CONFIG_OF at all.
> >>
> >>I don't know what's wrong with that machine but ACPI reports that
> >>cpus and memory from node 0 as connected to node 1 and everything
> >>seems worked fine until lates upgrade -- seems like buggy static-inline
> >>of_node_to_nid was intoduced in 3.13 but x86 ioapic uses it during
> >>early allocations only in since 3.17. Machine owner teells that 3.15
> >>worked fine.
> >
> >So, this was a qemu emulation of this actual physical machine without a
> >node 0?
> 
> Yep. Also I have crash from real machine but that stacktrace is messy
> because CONFIG_DEBUG_VM wasn't enabled and kernel crashed inside
> buddy allocator when tried to touch unallocated numa node structure.
> 
> >
> >As I mentioned, there are lots of node 0 assumptions through the kernel.
> >You might run into more issues at runtime.
> 
> I think it's possible to trigger kernel crash for any memoryless numa
> node (not just for 0) if some device (like ioapic in my case) points to
> it in its acpi tables. In runtime numa affinity configured by user
> usually validated by the kernel, while numbers from firmware might
> be used without proper validation.
> 
> Anyway seems like at least one x86 machines works fine without
> memory in node 0.

You're going to run into more issues, without adding proper memoryless
node support, I think.

-Nish

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html