Re: [PATCH v2 1/2] hugetlb: Fix hugepages_setup when deal with pernode

Mike Kravetz <mike.kravetz@xxxxxxxxxx> · Mon, 4 Apr 2022 16:48:35 -0700

On 4/4/22 03:41, David Hildenbrand wrote:
> On 01.04.22 19:23, Mike Kravetz wrote:
>> On 4/1/22 03:43, David Hildenbrand wrote:
>>> On 01.04.22 12:12, Peng Liu wrote:
>>>> Hugepages can be specified to pernode since "hugetlbfs: extend
>>>> the definition of hugepages parameter to support node allocation",
>>>> but the following problem is observed.
>>>>
>>>> Confusing behavior is observed when both 1G and 2M hugepage is set
>>>> after "numa=off".
>>>>  cmdline hugepage settings:
>>>>   hugepagesz=1G hugepages=0:3,1:3
>>>>   hugepagesz=2M hugepages=0:1024,1:1024
>>>>  results:
>>>>   HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
>>>>   HugeTLB registered 2.00 MiB page size, pre-allocated 1024 pages
>>>>
>>>> Furthermore, confusing behavior can be also observed when invalid
>>>> node behind valid node.
>>>>
>>>> To fix this, hugetlb_hstate_alloc_pages should be called even when
>>>> hugepages_setup going to invalid.
>>>
>>> Shouldn't we bail out if someone requests node-specific allocations but
>>> we are not running with NUMA?
>>
>> I thought about this as well, and could not come up with a good answer.
>> Certainly, nobody SHOULD specify both 'numa=off' and ask for node specific
>> allocations on the same command line.  I would have no problem bailing out
>> in such situations.  But, I think that would also require the hugetlb command
>> line processing to look for such situations.
> 
> Yes. Right now I see
> 
> if (tmp >= nr_online_nodes)
> 	goto invalid;
> 
> Which seems a little strange, because IIUC, it's the number of online
> nodes, which is completely wrong with a sparse online bitmap. Just
> imagine node 0 and node 2 are online, and node 1 is offline. Assuming
> that "node < 2" is valid is wrong.
> 
> Why don't we check for node_online() and bail out if that is not the
> case? Is it too early for that check? But why does comparing against
> nr_online_nodes() work, then?
> 
> 
> Having that said, I'm not sure if all usage of nr_online_nodes in
> mm/hugetlb.c is wrong, with a sparse online bitmap. Outside of that,
> it's really just used for "nr_online_nodes > 1". I might be wrong, though.

I think you are correct.  My bad for not being more thorough in reviewing
the original patch that added this code.  My incorrect assumption was that
a sparse node map was only possible via offline operations which could not
happen this early in boot.  I now see that a sparse map can be presented
by fw/bios/etc.  So, yes I do believe we need to check for online nodes.

-- 
Mike Kravetz

> 
>>
>> One could also argue that if there is only a single node (not numa=off on
>> command line) and someone specifies node local allocations we should bail.
> 
> I assume "numa=off" is always parsed before hugepages_setup() is called,
> right? So we can just rely on the actual numa information.
> 
> 
>>
>> I was 'thinking' about a situation where we had multiple nodes and node
>> local allocations were 'hard coded' via grub or something.  Then, for some
>> reason one node fails to come up on a reboot.  Should we bail on all the
>> hugetlb allocations, or should we try to allocate on the still available
>> nodes?
> 
> Depends on what "bail" means. Printing a warning and stopping to
> allocate further is certainly good enough for my taste :)
> 
>>
>> When I went back and reread the reason for this change, I see that it is
>> primarily for 'some debugging and test cases'.
>>
>>>
>>> What's the result after your change?
>>>
>>>>
>>>> Cc: <stable@xxxxxxxxxxxxxxx>
>>>
>>> I am not sure if this is really stable material.
>>
>> Right now, we partially and inconsistently process node specific allocations
>> if there are missing nodes.  We allocate 'regular' hugetlb pages on existing
>> nodes.  But, we do not allocate gigantic hugetlb pages on existing nodes.
>>
>> I believe this is worth fixing in stable.
> 
> I am skeptical.
> 
> https://www.kernel.org/doc/Documentation/process/stable-kernel-rules.rst
> 
> " - It must fix a real bug that bothers people (not a, "This could be a
>    problem..." type thing)."
> 
> While the current behavior is suboptimal, it's certainly not an urgent
> bug (?) and the kernel will boot and work just fine. As you mentioned
> "nobody SHOULD specify both 'numa=off' and ask for node specific
> allocations on the same command line.", this is just a corner case.
> 
> Adjusting it upstream -- okay. Backporting to stable? I don't think so.
>