On Mon, Nov 12, 2018 at 10:15:46PM +0000, Elliott, Robert (Persistent Memory) wrote: > > > > -----Original Message----- > > From: Daniel Jordan <daniel.m.jordan@xxxxxxxxxx> > > Sent: Monday, November 12, 2018 11:54 AM > > To: Elliott, Robert (Persistent Memory) <elliott@xxxxxxx> > > Cc: Daniel Jordan <daniel.m.jordan@xxxxxxxxxx>; linux-mm@xxxxxxxxx; > > kvm@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; aarcange@xxxxxxxxxx; > > aaron.lu@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; alex.williamson@xxxxxxxxxx; > > bsd@xxxxxxxxxx; darrick.wong@xxxxxxxxxx; dave.hansen@xxxxxxxxxxxxxxx; > > jgg@xxxxxxxxxxxx; jwadams@xxxxxxxxxx; jiangshanlai@xxxxxxxxx; > > mhocko@xxxxxxxxxx; mike.kravetz@xxxxxxxxxx; Pavel.Tatashin@xxxxxxxxxxxxx; > > prasad.singamsetty@xxxxxxxxxx; rdunlap@xxxxxxxxxxxxx; > > steven.sistare@xxxxxxxxxx; tim.c.chen@xxxxxxxxx; tj@xxxxxxxxxx; > > vbabka@xxxxxxx > > Subject: Re: [RFC PATCH v4 11/13] mm: parallelize deferred struct page > > initialization within each node > > > > On Sat, Nov 10, 2018 at 03:48:14AM +0000, Elliott, Robert (Persistent > > Memory) wrote: > > > > -----Original Message----- > > > > From: linux-kernel-owner@xxxxxxxxxxxxxxx <linux-kernel- > > > > owner@xxxxxxxxxxxxxxx> On Behalf Of Daniel Jordan > > > > Sent: Monday, November 05, 2018 10:56 AM > > > > Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page > > > > initialization within each node > > > > > ... > > > > In testing, a reasonable value turned out to be about a quarter of the > > > > CPUs on the node. > > > ... > > > > + /* > > > > + * We'd like to know the memory bandwidth of the chip to > > > > calculate the > > > > + * most efficient number of threads to start, but we can't. > > > > + * In testing, a good value for a variety of systems was a > > > > quarter of the CPUs on the node. > > > > + */ > > > > + nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4); > > > > > > > > > You might want to base that calculation on and limit the threads to > > > physical cores, not hyperthreaded cores. > > > > Why? Hyperthreads can be beneficial when waiting on memory. That said, I > > don't have data that shows that in this case. > > I think that's only if there are some register-based calculations to do while > waiting. If both threads are just doing memory accesses, they'll both stall, and > there doesn't seem to be any benefit in having two contexts generate the IOs > rather than one (at least on the systems I've used). I think it takes longer > to switch contexts than to just turnaround the next IO. (Sorry for the delay, Plumbers is over now...) I guess we're both just waving our hands without data. I've only got x86, so using a quarter of the CPUs rules out HT on my end. Do you have a system that you can test this on, where using a quarter of the CPUs will involve HT? Thanks, Daniel