Re: [PATCH 2/2] mm: keep nid around during hot-remove

David Hildenbrand <david@xxxxxxxxxx> · Wed, 7 Aug 2024 17:23:48 +0200

On 07.08.24 16:40, Pasha Tatashin wrote:
On Wed, Aug 7, 2024 at 7:50 AM David Hildenbrand <david@xxxxxxxxxx> wrote:

On 07.08.24 13:32, David Hildenbrand wrote:
On 07.08.24 00:14, Pasha Tatashin wrote:
nid is needed during memory hot-remove in order to account the
information about the memmap overhead that is being removed.

In addition, we cannot use page_pgdat(pfn_to_page(pfn)) during
hotremove after remove_pfn_range_from_zone().

We also cannot determine nid from walking through memblocks after
remove_memory_block_devices() is called.

Therefore, pass nid down from the beginning of hotremove to where
it is used for the accounting purposes.

I was happy to finally remove that nid parameter for good in:

commit 65a2aa5f482ed0c1b5afb9e6b0b9e0b16bb8b616
Author: David Hildenbrand <david@xxxxxxxxxx>
Date:   Tue Sep 7 19:55:04 2021 -0700

       mm/memory_hotplug: remove nid parameter from arch_remove_memory()

To ask the real question: Do we really need this counter per-nid at all?

Seems to over-complicate things.

Case in point: I think the handling is wrong?

Just because some memory belongs to a nid doesn't mean that the vmemmap
was allocated from that nid?

I believe when we hot-add we use nid for the memory that is being
added to account vmemmap, and when we do hot-remove we also use nid of
the memory that is being removed. But, you are correct, this does not
guarantee that the actual vmemmap memory is being allocated or removed
from the given nid.

Right. For boot memory that we might want to unplug later it might be 
different. I recall that with "movable_node", we might end up allocating 
the vmemmap from remote nodes, such that all memory of a node stays 
movable. That's why __earlyonly_bootmem_alloc() ends up calling 
memblock_alloc_try_nid_raw(), to fallback to other nodes if required.

Wouldn't we want to look at the actual nid the vmemmap page belongs to
that we are removing?

I am now looking into converting this counter to be system wide, i.e.
vm_event, it is all done under hotplug lock, so there is no
contention.

That would be easiest, assuming per-node information is not strictly 
required for now.

--
Cheers,

David / dhildenb