On 18-11-09 16:46:02, Alexander Duyck wrote: > On Fri, 2018-11-09 at 19:00 -0500, Pavel Tatashin wrote: > > On 18-11-09 15:14:35, Alexander Duyck wrote: > > > On Fri, 2018-11-09 at 16:15 -0500, Pavel Tatashin wrote: > > > > On 18-11-05 13:19:25, Alexander Duyck wrote: > > > > > This patchset is essentially a refactor of the page initialization logic > > > > > that is meant to provide for better code reuse while providing a > > > > > significant improvement in deferred page initialization performance. > > > > > > > > > > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent > > > > > memory per node I have seen the following. In the case of regular memory > > > > > initialization the deferred init time was decreased from 3.75s to 1.06s on > > > > > average. For the persistent memory the initialization time dropped from > > > > > 24.17s to 19.12s on average. This amounts to a 253% improvement for the > > > > > deferred memory initialization performance, and a 26% improvement in the > > > > > persistent memory initialization performance. > > > > > > > > Hi Alex, > > > > > > > > Please try to run your persistent memory init experiment with Daniel's > > > > patches: > > > > > > > > https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jordan@xxxxxxxxxx/ > > > > > > I've taken a quick look at it. It seems like a bit of a brute force way > > > to try and speed things up. I would be worried about it potentially > > > > There is a limit to max number of threads that ktasks start. The memory > > throughput is *much* higher than what one CPU can maxout in a node, so > > there is no reason to leave the other CPUs sit idle during boot when > > they can help to initialize. > > Right, but right now that limit can still be pretty big when it is > something like 25% of all the CPUs on a 288 CPU system. It is still OK. About 9 threads per node. That machine has 1T of memory, which means 8 nodes need to initialize 2G of memory each. With 46G/s throughout it should take 0.043s Which is 10 times higher than what Daniel sees with 0.325s, so there is still room to saturate the memory throughput. Now, if the multi-threadding efficiency is good, it should take 1.261s / 9 threads = 0.14s > > One issue is the way the code was ends up essentially blowing out the > cache over and over again. Doing things in two passes made it really > expensive as you took one cache miss to initialize it, and another to > free it. I think getting rid of that is one of the biggest gains with > my patch set. I am not arguing that your patches make sense, all I am saying that ktasks improve time order of magnitude better on machines with large amount of memory. Pasha