On Thu, 8 Feb 2018 13:45:55 -0500 Pavel Tatashin <pasha.tatashin@xxxxxxxxxx> wrote: > Deferred page initialization allows the boot cpu to initialize a small > subset of the system's pages early in boot, with other cpus doing the rest > later on. > > It is, however, problematic to know how many pages the kernel needs during > boot. Different modules and kernel parameters may change the requirement, > so the boot cpu either initializes too many pages or runs out of memory. > > To fix that, initialize early pages on demand. This ensures the kernel > does the minimum amount of work to initialize pages during boot and leaves > the rest to be divided in the multithreaded initialization path > (deferred_init_memmap). > > The on-demand code is permanently disabled using static branching once > deferred pages are initialized. After the static branch is changed to > false, the overhead is up-to two branch-always instructions if the zone > watermark check fails or if rmqueue fails. > > ... > > @@ -1604,6 +1566,96 @@ static int __init deferred_init_memmap(void *data) > pgdat_init_report_one_done(); > return 0; > } > + > +/* > + * Protects some early interrupt threads, and also for a short period of time > + * from smp_init() to page_alloc_init_late() when deferred pages are > + * initialized. > + */ > +static __initdata DEFINE_SPINLOCK(deferred_zone_grow_lock); Comment is a little confusing. Locks don't protect "threads" - they protect data. Can we be specific about which data is being protected? Why is a new lock needed here? Those data structures already have designated locks, don't they? If the lock protects "early interrupt threads" then it's surprising to see it taken with spin_lock() and not spin_lock_irqsave()? > +DEFINE_STATIC_KEY_TRUE(deferred_pages); > + > +/* > + * If this zone has deferred pages, try to grow it by initializing enough > + * deferred pages to satisfy the allocation specified by order, rounded up to > + * the nearest PAGES_PER_SECTION boundary. So we're adding memory in increments > + * of SECTION_SIZE bytes by initializing struct pages in increments of > + * PAGES_PER_SECTION * sizeof(struct page) bytes. > + */ Please also document the return value. > +static noinline bool __init Why was noinline needed? > +deferred_grow_zone(struct zone *zone, unsigned int order) > +{ > + int zid = zone_idx(zone); > + int nid = zone->node; > + pg_data_t *pgdat = NODE_DATA(nid); > + unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION); > + unsigned long nr_pages = 0; > + unsigned long first_init_pfn, first_deferred_pfn, spfn, epfn, t; > + phys_addr_t spa, epa; > + u64 i; > + > + /* Only the last zone may have deferred pages */ > + if (zone_end_pfn(zone) != pgdat_end_pfn(pgdat)) > + return false; > + > + first_deferred_pfn = READ_ONCE(pgdat->first_deferred_pfn); It would be nice to have a little comment explaining why READ_ONCE was needed. Would it still be needed if this code was moved into the locked region? > + first_init_pfn = max(zone->zone_start_pfn, first_deferred_pfn); > + > + if (first_init_pfn >= pgdat_end_pfn(pgdat)) > + return false; > + > + spin_lock(&deferred_zone_grow_lock); > + /* > + * Bail if we raced with another thread that disabled on demand > + * initialization. > + */ > + if (!static_branch_unlikely(&deferred_pages)) { > + spin_unlock(&deferred_zone_grow_lock); > + return false; > + } > + > + for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) { > + spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa)); > + epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa)); > + > + while (spfn < epfn && nr_pages < nr_pages_needed) { > + t = ALIGN(spfn + PAGES_PER_SECTION, PAGES_PER_SECTION); > + first_deferred_pfn = min(t, epfn); > + nr_pages += deferred_init_pages(nid, zid, spfn, > + first_deferred_pfn); > + spfn = first_deferred_pfn; > + } > + > + if (nr_pages >= nr_pages_needed) > + break; > + } > + > + for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) { > + spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa)); > + epfn = min_t(unsigned long, first_deferred_pfn, PFN_DOWN(epa)); > + deferred_free_pages(nid, zid, spfn, epfn); > + > + if (first_deferred_pfn == epfn) > + break; > + } > + WRITE_ONCE(pgdat->first_deferred_pfn, first_deferred_pfn); > + spin_unlock(&deferred_zone_grow_lock); > + > + return nr_pages >= nr_pages_needed; > +} > + > +/* > + * deferred_grow_zone() is __init, but it is called from > + * get_page_from_freelist() during early boot until deferred_pages permanently > + * disables this call. This is why, we have refdata wrapper to avoid warning, > + * and ensure that the function body gets unloaded. s/why,/why/ s/ensure/to ensure/ > + */ > +static bool __ref > +_deferred_grow_zone(struct zone *zone, unsigned int order) > +{ > + return deferred_grow_zone(zone, order); > +} > + > #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */ > > void __init page_alloc_init_late(void) > @@ -1613,6 +1665,14 @@ void __init page_alloc_init_late(void) > #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT > int nid; > > + /* > + * We are about to initialize the rest of deferred pages, permanently > + * disable on-demand struct page initialization. > + */ > + spin_lock(&deferred_zone_grow_lock); > + static_branch_disable(&deferred_pages); > + spin_unlock(&deferred_zone_grow_lock); Ah, so the new lock is to protect the static branch machinery only? > /* There will be num_node_state(N_MEMORY) threads */ > atomic_set(&pgdat_init_n_undone, num_node_state(N_MEMORY)); > for_each_node_state(nid, N_MEMORY) { > > ... > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>