On 2019/10/25 16:16, Michal Hocko wrote: > On Thu 24-10-19 12:40:13, Bjorn Helgaas wrote: >> On Thu, Oct 24, 2019 at 11:20:13AM +0200, Michal Hocko wrote: >>> On Wed 23-10-19 12:10:39, Bjorn Helgaas wrote: >>>> On Wed, Oct 23, 2019 at 04:22:43PM +0800, Yunsheng Lin wrote: >>>>> On 2019/10/23 5:04, Bjorn Helgaas wrote: >>>>>> On Sat, Oct 19, 2019 at 02:45:43PM +0800, Yunsheng Lin wrote: >>>> >>>>>> I think the underlying problem you're addressing is that: >>>>>> >>>>>> - NUMA_NO_NODE == -1, >>>>>> - dev_to_node(dev) may return NUMA_NO_NODE, >>>>>> - kmalloc(dev) relies on cpumask_of_node(dev_to_node(dev)), and >>>>>> - cpumask_of_node(NUMA_NO_NODE) makes an invalid array reference >>>>>> >>>>>> For example, on arm64, mips loongson, s390, and x86, >>>>>> cpumask_of_node(node) returns "node_to_cpumask_map[node]", and -1 is >>>>>> an invalid array index. >>>>> >>>>> The invalid array index of -1 is the underlying problem here when >>>>> cpumask_of_node(dev_to_node(dev)) is called and cpumask_of_node() >>>>> is not NUMA_NO_NODE aware yet. >>>>> >>>>> In the "numa: make node_to_cpumask_map() NUMA_NO_NODE aware" thread >>>>> disscusion, it is requested that it is better to warn about the pcie >>>>> device without a node assigned by the firmware before making the >>>>> cpumask_of_node() NUMA_NO_NODE aware, so that the system with pci >>>>> devices of "NUMA_NO_NODE" node can be fixed by their vendor. >>>>> >>>>> See: https://lore.kernel.org/lkml/20191011111539.GX2311@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ >>>> >>>> Right. We should warn if the NUMA node number would help us but DT or >>>> the firmware didn't give us one. >>>> >>>> But we can do that independently of any cpumask_of_node() changes. >>>> There's no need to do one patch before the other. Even if you make >>>> cpumask_of_node() tolerate NUMA_NO_NODE, we'll still get the warning >>>> because we're not actually changing any node assignments. >>> >>> Agreed. And this has been proposed previously I believe but my >>> understanding was that Petr was against that. >> >> I don't think Peter was opposed to a warning. > > Now, he was opposed to cpumask_of_node handling IIRC. That was my understanding too. But I am not sure if Peter is still opposed to cpumask_of_node() after the pcie device without node affinity is warned in this patch. >From previous discussion [1]: >Yunsheng> But I failed to see why the above is related to making node_to_cpumask_map() >Yunsheng> NUMA_NO_NODE aware? Peter> Your initial bug is for hns3, which is a PCI device, which really _MUST_ Peter> have a node assigned. Peter> It not having one, is a straight up bug. We must not silently accept Peter> NO_NODE there, ever." I think Peter is not opposed to cpumask_of_node() after this patch if I understand it correctly. [1] https://lore.kernel.org/lkml/20191011111539.GX2311@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ > >> I assume you're >> referring to [1], which is about how cpumask_of_node() should work. >> That's not my area, so I don't have a strong opinion. From that >> discussion: >> >> Yunsheng> From what I can see, the problem can be fixed in three >> Yunsheng> place: >> Yunsheng> 1. Make user dev_to_node return a valid node id >> Yunsheng> even when proximity domain is not set by bios(or node id >> Yunsheng> set by buggy bios is not valid), which may need info from >> Yunsheng> the numa system to make sure it will return a valid node. >> Yunsheng> >> Yunsheng> 2. User that call cpumask_of_node should ensure the node >> Yunsheng> id is valid before calling cpumask_of_node, and user also >> Yunsheng> need some info to make ensure node id is valid. >> Yunsheng> >> Yunsheng> 3. Make sure cpumask_of_node deal with invalid node id as >> Yunsheng> this patchset. >> >> Peter> 1) because even it is not set, the device really does belong >> Peter> to a node. It is impossible a device will have magic >> Peter> uniform access to memory when CPUs cannot. >> Peter> >> Peter> 2) is already true today, cpumask_of_node() requires a valid >> Peter> node_id. >> Peter> >> Peter> 3) is just wrong and increases overhead for everyone. >> >> I think Peter is advocating (1), i.e., if firmware doesn't tell us a >> node ID, we just pick one. We can certainly do that in *addition* to >> adding a warning. I'd like it to be a separate patch because I think >> we want the warning no matter what so users have some clue that >> performance could be better. > > Yes, those should be two different patches. > >> If we pick one, I guess we either assign some default, like node 0, to >> everything; or we somehow pick a random node to assign. > > I have tried to explain that picking a default node is tricky because > node 0 is not generally available and you never know when a node might > just disappear if the device is not bound to it. Maybe the example you already mentioned in previous discussion may help here: 3e8589963773 ("memcg: make it work on sparse non-0-node systems") > > I really do not see why the proposed online_cpu_mask for that case is > such a big deal. It will likely lead to suboptimal performance but what > do you expect from a suboptimal HW description. There is no question > that the proper node affinity should be set and the warning might really > help here but trying to be clever and find a replacement sounds like > potential for more subtly broken results than doing a straightforward > thing. > > But I will just go silent now because I am just repeating myself. >