On Fri, Jan 28, 2022 at 04:26:20PM +0100, David Hildenbrand wrote: > For memory hot(un)plug, we only really care about memory blocks that: > * span a single zone (and, thereby, a single node) > * are completely System RAM (IOW, no holes, no ZONE_DEVICE) > If one of these conditions is not met, we reject memory offlining. > Hotplugged memory blocks (starting out offline), always meet both > conditions. This has been always hard for me to follow, so bear with me. I remember we changed the memory-hotplug policy, not long ago, wrt. what we can online/offline so we could get rid of certain assumptions like "there are no holes in this memblock, so it can go" etc. AFAIR, we can only offline if the memory 1) belongs to a single node ("which is always the case for hotplugged-memory, boot memory is trickier") 2) does not have any holes 3) spans a single zone These are the only requeriments we have atm, right? By default, hotplugged memory already complies with they all three, only in case of ZONE_DEVICE stuff we might violate 2) and 3). > There are three scenarios to handle: ... ... > @@ -225,6 +226,9 @@ static int memory_block_offline(struct memory_block *mem) > unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages; > int ret; > > + if (!mem->zone) > + return -EBUSY; Should not we return -EINVAL? I mean, -EBUSY reads like this might be a temporary error which might get fixed later on, but that isn't the case. > @@ -234,7 +238,7 @@ static int memory_block_offline(struct memory_block *mem) > -nr_vmemmap_pages); > > ret = offline_pages(start_pfn + nr_vmemmap_pages, > - nr_pages - nr_vmemmap_pages, mem->group); > + nr_pages - nr_vmemmap_pages, mem->zone, mem->group); Why not passing the node as well? > +static struct zone *early_node_zone_for_memory_block(struct memory_block *mem, > + int nid) > +{ > + const unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); > + const unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; > + struct zone *zone, *matching_zone = NULL; > + pg_data_t *pgdat = NODE_DATA(nid); I was about to complain because in init_memory_block() you call early_node_zone_for_memory_block() with nid == NUMA_NODE_NODE, but then I saw that NODE_DATA on !CONFIG_NUMA falls to contig_page_data. So, I guess we cannot really reach this on CONFIG_NUMA machines with nid being NUMA_NO_NODE, right? (do we want to add a check just in case?) > +#ifdef CONFIG_NUMA > +void memory_block_set_nid(struct memory_block *mem, int nid, > + enum meminit_context context) But we also set the zone? (Only for boot memory) > @@ -663,6 +734,17 @@ static int init_memory_block(unsigned long block_id, unsigned long state, > mem->nr_vmemmap_pages = nr_vmemmap_pages; > INIT_LIST_HEAD(&mem->group_next); > > +#ifndef CONFIG_NUMA > + if (state == MEM_ONLINE) > + /* > + * MEM_ONLINE at this point imples early memory. With NUMA, implies -- Oscar Salvador SUSE Labs