On 29.08.19 18:39, Alexander Duyck wrote: > On Thu, Aug 29, 2019 at 12:00 AM David Hildenbrand <david@xxxxxxxxxx> wrote: >> >> As far as I can see, the original split wanted to speed up a duplicate >> initialization. We now only initialize once - and we really should >> initialize under the lock, when resizing the zones. > > What do you mean by duplicate initialization? Basically the issue was > that we can have systems with massive memory footprints and I was just > trying to get the initialization time under a minute. The compromise I > made was to split the initialization so that we only initialized the > pages in the altmap and defer the rest so that they can be initialized > in parallel. > > What this patch does is serialize the initialization so it will likely > take 2 to 4 minutes or more to initialize memory on a system where I > had brought the init time under about 30 seconds. > >> As soon as we drop the lock we might have memory unplug running, trying >> to shrink the zone again. In case the memmap was not properly initialized, >> the zone/node shrinking code might get false negatives when search for >> the new zone/node boundaries - bad. We suddenly could no longer span the >> memory we just added. > > The issue as I see it is that we can have systems with multiple nodes > and each node can populate a fairly significant amount of data. By > pushing this all back under the hotplug lock you are serializing the > initialization for each node resulting in a 4x or 8x increase in > initialization time on some of these bigger systems. > >> Also, once we want to fix set_zone_contiguous(zone) inside >> move_pfn_range_to_zone() to actually work with ZONE_DEVICE (instead of >> always immediately stopping and never setting zone->contiguous) we have >> to have the whole memmap initialized at that point. (not sure if we >> want that in the future, just a note) >> >> Let's just keep things simple and initialize the memmap when resizing >> the zones under the lock. >> >> If this is a real performance issue, we have to watch out for >> alternatives. > > My concern is that this is going to become a performance issue, but I > don't have access to the hardware at the moment to test how much of > one. I'll have to check to see if I can get access to a system to test > this patch set. > Thanks for having a look - I already dropped this patch again. We will rather stop shrinking the ZONE_DEVICE. So assume this patch is gone. [...] > So if you are moving this all under the lock then this is going to > serialize initialization and it is going to be quite expensive on bit > systems. I was only ever able to get the init time down to something > like ~20s with the optimized function. Since that has been torn apart > and you are back to doing things with memmap_init_zone we are probably > looking at more like 25-30s for each node, and on a 4 node system we > are looking at 2 minutes or so which may lead to issues if people are > mounting this. > > Instead of forcing this all to be done under the hotplug lock is there > some way we could do this under the zone span_seqlock instead to > achieve the same result? I guess the right approach really is as Michal suggest to not shrink at all (at least ZONE_DEVICE) :) > >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index c5d62f1c2851..44038665fe8e 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -5845,7 +5845,7 @@ overlap_memmap_init(unsigned long zone, unsigned long *pfn) >> */ >> void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, >> unsigned long start_pfn, enum memmap_context context, >> - struct vmem_altmap *altmap) >> + struct dev_pagemap *pgmap) >> { >> unsigned long pfn, end_pfn = start_pfn + size; >> struct page *page; >> @@ -5853,24 +5853,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, >> if (highest_memmap_pfn < end_pfn - 1) >> highest_memmap_pfn = end_pfn - 1; >> >> -#ifdef CONFIG_ZONE_DEVICE >> - /* >> - * Honor reservation requested by the driver for this ZONE_DEVICE >> - * memory. We limit the total number of pages to initialize to just >> - * those that might contain the memory mapping. We will defer the >> - * ZONE_DEVICE page initialization until after we have released >> - * the hotplug lock. >> - */ >> - if (zone == ZONE_DEVICE) { >> - if (!altmap) >> - return; >> - >> - if (start_pfn == altmap->base_pfn) >> - start_pfn += altmap->reserve; >> - end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap); >> - } >> -#endif >> - >> for (pfn = start_pfn; pfn < end_pfn; pfn++) { >> /* >> * There can be holes in boot-time mem_map[]s handed to this >> @@ -5892,6 +5874,20 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, >> if (context == MEMMAP_HOTPLUG) >> __SetPageReserved(page); >> >> +#ifdef CONFIG_ZONE_DEVICE >> + if (zone == ZONE_DEVICE) { >> + WARN_ON_ONCE(!pgmap); >> + /* >> + * ZONE_DEVICE pages union ->lru with a ->pgmap back >> + * pointer and zone_device_data. It is a bug if a >> + * ZONE_DEVICE page is ever freed or placed on a driver >> + * private list. >> + */ >> + page->pgmap = pgmap; >> + page->zone_device_data = NULL; >> + } >> +#endif >> + > > So I am not sure this is right. Part of the reason for the split is > that the pages that were a part of the altmap had an LRU setup, not > the pgmap/zone_device_data setup. This is changing that. Yeah, you might be right, we would have to handle the altmap part separately. > > Also, I am more a fan of just testing pgmap and if it is present then > assign page->pgmap and reset zone_device_data. Then you can avoid the > test for zone on every iteration and the WARN_ON_ONCE check, or at > least you could move it outside the loop so we don't incur the cost > with each page. Are there situations where you would have pgmap but > not a ZONE_DEVICE page? Don't think so. But I am definitely not a ZONE_DEVICE expert. > >> /* >> * Mark the block movable so that blocks are reserved for >> * movable at startup. This will force kernel allocations >> @@ -5951,14 +5947,6 @@ void __ref memmap_init_zone_device(struct zone *zone, >> */ >> __SetPageReserved(page); >> >> - /* >> - * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer >> - * and zone_device_data. It is a bug if a ZONE_DEVICE page is >> - * ever freed or placed on a driver-private list. >> - */ >> - page->pgmap = pgmap; >> - page->zone_device_data = NULL; >> - >> /* >> * Mark the block movable so that blocks are reserved for >> * movable at startup. This will force kernel allocations > > Shouldn't you be removing this function instead of just gutting it? > I'm kind of surprised you aren't getting warnings about unused code > since you also pulled the declaration for it from the header. > Don't ask me what went wrong here, I guess I messed this up while rebasing. -- Thanks, David / dhildenb