On Mon, Sep 10, 2018 at 4:44 PM, Alexander Duyck <alexander.duyck@xxxxxxxxx> wrote: > From: Alexander Duyck <alexander.h.duyck@xxxxxxxxx> > > This patch is based off of the pci_call_probe function used to initialize > PCI devices. The general idea here is to move the probe call to a location > that is local to the memory being initialized. By doing this we can shave > significant time off of the total time needed for initialization. > > With this patch applied I see a significant reduction in overall init time > as without it the init varied between 23 and 37 seconds to initialize a 3GB > node. With this patch applied the variance is only between 23 and 26 > seconds to initialize each node. > > I hope to refine this further in the future by combining this logic into > the async_schedule_domain code that is already in use. By doing that it > would likely make this functionality redundant. Yeah, it is a bit sad that we schedule an async thread only to move it back somewhere else. Could we trivially achieve the same with an async_schedule_domain_on_cpu() variant? It seems we can and the workqueue core will "Do the right thing". I now notice that async uses the system_unbound_wq and work_on_cpu() uses the system_wq. I don't think we want long running nvdimm work on system_wq.