On Mon, May 4, 2020 at 4:44 PM Josh Triplett <josh@xxxxxxxxxxxxxxxx> wrote: > > On May 4, 2020 3:33:58 PM PDT, Alexander Duyck <alexander.duyck@xxxxxxxxx> wrote: > >On Thu, Apr 30, 2020 at 1:12 PM Daniel Jordan > ><daniel.m.jordan@xxxxxxxxxx> wrote: > >> /* > >> - * Initialize and free pages in MAX_ORDER sized increments so > >> - * that we can avoid introducing any issues with the buddy > >> - * allocator. > >> + * More CPUs always led to greater speedups on tested > >systems, up to > >> + * all the nodes' CPUs. Use all since the system is > >otherwise idle now. > >> */ > > > >I would be curious about your data. That isn't what I have seen in the > >past. Typically only up to about 8 or 10 CPUs gives you any benefit, > >beyond that I was usually cache/memory bandwidth bound. > > I've found pretty much linear performance up to memory bandwidth, and on the systems I was testing, I didn't saturate memory bandwidth until about the full number of physical cores. From number of cores up to number of threads, the performance stayed about flat; it didn't get any better or worse. That doesn't sound right though based on the numbers you provided. The system you had was 192GB spread over 2 nodes with 48thread/24core per node, correct? Your numbers went from ~290ms to ~28ms so a 10x decrease, that doesn't sound linear when you spread the work over 24 cores to get there. I agree that the numbers largely stay flat once you hit the peak, I have seen similar behavior when I was working on the deferred init code previously. One concern I have though is that we may end up seeing better performance with a subset of cores instead of running all of the cores/threads, especially if features such as turbo come into play. In addition we are talking x86 only so far. I would be interested in seeing if this has benefits or not for other architectures. Also what is the penalty that is being paid in order to break up the work before-hand and set it up for the parallel work? I would be interested in seeing what the cost is on a system with fewer cores per node, maybe even down to 1. That would tell us how much additional overhead is being added to set things up to run in parallel. If I get a chance tomorrow I might try applying the patches and doing some testing myself. Thanks. - Alex