On 4/4/22 03:41, David Hildenbrand wrote: > On 01.04.22 19:23, Mike Kravetz wrote: >> On 4/1/22 03:43, David Hildenbrand wrote: >>> On 01.04.22 12:12, Peng Liu wrote: >>>> Hugepages can be specified to pernode since "hugetlbfs: extend >>>> the definition of hugepages parameter to support node allocation", >>>> but the following problem is observed. >>>> >>>> Confusing behavior is observed when both 1G and 2M hugepage is set >>>> after "numa=off". >>>> cmdline hugepage settings: >>>> hugepagesz=1G hugepages=0:3,1:3 >>>> hugepagesz=2M hugepages=0:1024,1:1024 >>>> results: >>>> HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages >>>> HugeTLB registered 2.00 MiB page size, pre-allocated 1024 pages >>>> >>>> Furthermore, confusing behavior can be also observed when invalid >>>> node behind valid node. >>>> >>>> To fix this, hugetlb_hstate_alloc_pages should be called even when >>>> hugepages_setup going to invalid. >>> >>> Shouldn't we bail out if someone requests node-specific allocations but >>> we are not running with NUMA? >> >> I thought about this as well, and could not come up with a good answer. >> Certainly, nobody SHOULD specify both 'numa=off' and ask for node specific >> allocations on the same command line. I would have no problem bailing out >> in such situations. But, I think that would also require the hugetlb command >> line processing to look for such situations. > > Yes. Right now I see > > if (tmp >= nr_online_nodes) > goto invalid; > > Which seems a little strange, because IIUC, it's the number of online > nodes, which is completely wrong with a sparse online bitmap. Just > imagine node 0 and node 2 are online, and node 1 is offline. Assuming > that "node < 2" is valid is wrong. > > Why don't we check for node_online() and bail out if that is not the > case? Is it too early for that check? But why does comparing against > nr_online_nodes() work, then? > > > Having that said, I'm not sure if all usage of nr_online_nodes in > mm/hugetlb.c is wrong, with a sparse online bitmap. Outside of that, > it's really just used for "nr_online_nodes > 1". I might be wrong, though. I think you are correct. My bad for not being more thorough in reviewing the original patch that added this code. My incorrect assumption was that a sparse node map was only possible via offline operations which could not happen this early in boot. I now see that a sparse map can be presented by fw/bios/etc. So, yes I do believe we need to check for online nodes. -- Mike Kravetz > >> >> One could also argue that if there is only a single node (not numa=off on >> command line) and someone specifies node local allocations we should bail. > > I assume "numa=off" is always parsed before hugepages_setup() is called, > right? So we can just rely on the actual numa information. > > >> >> I was 'thinking' about a situation where we had multiple nodes and node >> local allocations were 'hard coded' via grub or something. Then, for some >> reason one node fails to come up on a reboot. Should we bail on all the >> hugetlb allocations, or should we try to allocate on the still available >> nodes? > > Depends on what "bail" means. Printing a warning and stopping to > allocate further is certainly good enough for my taste :) > >> >> When I went back and reread the reason for this change, I see that it is >> primarily for 'some debugging and test cases'. >> >>> >>> What's the result after your change? >>> >>>> >>>> Cc: <stable@xxxxxxxxxxxxxxx> >>> >>> I am not sure if this is really stable material. >> >> Right now, we partially and inconsistently process node specific allocations >> if there are missing nodes. We allocate 'regular' hugetlb pages on existing >> nodes. But, we do not allocate gigantic hugetlb pages on existing nodes. >> >> I believe this is worth fixing in stable. > > I am skeptical. > > https://www.kernel.org/doc/Documentation/process/stable-kernel-rules.rst > > " - It must fix a real bug that bothers people (not a, "This could be a > problem..." type thing)." > > While the current behavior is suboptimal, it's certainly not an urgent > bug (?) and the kernel will boot and work just fine. As you mentioned > "nobody SHOULD specify both 'numa=off' and ask for node specific > allocations on the same command line.", this is just a corner case. > > Adjusting it upstream -- okay. Backporting to stable? I don't think so. >