Re: [PATCH 6/7] mm: parallelize deferred_init_memmap()

Daniel Jordan <daniel.m.jordan@xxxxxxxxxx> · Mon, 4 May 2020 21:26:01 -0400

On Mon, May 04, 2020 at 03:33:58PM -0700, Alexander Duyck wrote:
> On Thu, Apr 30, 2020 at 1:12 PM Daniel Jordan
> > @@ -1778,15 +1798,25 @@ static int __init deferred_init_memmap(void *data)
> >                 goto zone_empty;
> >
> >         /*
> > -        * Initialize and free pages in MAX_ORDER sized increments so
> > -        * that we can avoid introducing any issues with the buddy
> > -        * allocator.
> > +        * More CPUs always led to greater speedups on tested systems, up to
> > +        * all the nodes' CPUs.  Use all since the system is otherwise idle now.
> >          */
> 
> I would be curious about your data. That isn't what I have seen in the
> past. Typically only up to about 8 or 10 CPUs gives you any benefit,
> beyond that I was usually cache/memory bandwidth bound.

I was surprised too!  For most of its development, this set had an interface to
get the number of cores on the theory that this was about where the bandwidth
got saturated, but the data showed otherwise.

There were diminishing returns, but they were more apparent on Haswell than
Skylake for instance.  I'll post some more data later in the thread where you
guys are talking about it.

> 
> > +       max_threads = max(cpumask_weight(cpumask), 1u);
> > +
> 
> We will need to gather data on if having a ton of threads works for
> all architectures.

Agreed.  I'll rope in some of the arch lists in the next version and include
the debugging knob to vary the thread count.

> For x86 I think we are freeing back pages in
> pageblock_order sized chunks so we only have to touch them once in
> initialize and then free the two pageblock_order chunks into the buddy
> allocator.
> 
> >         for_each_free_mem_pfn_range_in_zone_from(i, zone, &spfn, &epfn) {
> > -               while (spfn < epfn) {
> > -                       nr_pages += deferred_init_maxorder(zone, &spfn, epfn);
> > -                       cond_resched();
> > -               }
> > +               struct def_init_args args = { zone, ATOMIC_LONG_INIT(0) };
> > +               struct padata_mt_job job = {
> > +                       .thread_fn   = deferred_init_memmap_chunk,
> > +                       .fn_arg      = &args,
> > +                       .start       = spfn,
> > +                       .size        = epfn - spfn,
> > +                       .align       = MAX_ORDER_NR_PAGES,
> > +                       .min_chunk   = MAX_ORDER_NR_PAGES,
> > +                       .max_threads = max_threads,
> > +               };
> > +
> > +               padata_do_multithreaded(&job);
> > +               nr_pages += atomic_long_read(&args.nr_pages);
> >         }
> >  zone_empty:
> >         /* Sanity check that the next zone really is unpopulated */
> 
> Okay so looking at this I can see why you wanted to structure the
> other patch the way you did. However I am not sure that is the best
> way to go about doing it. It might make more sense to go through and
> accumulate sections. If you hit the end of a range and the start of
> the next range is in another section, then you split it as a new job,
> otherwise I would just accumulate it into the current job. You then
> could section align the work and be more or less guaranteed that each
> worker thread should be generating finished work products, and not
> incomplete max order pages.

This guarantee holds now with the max-order alignment passed to padata, so I
don't see what more doing it on section boundaries buys us.