(Changed the subject, to make it more apparent what we are talking about.) * Mike Travis <travis@xxxxxxx> wrote: > On 6/25/2013 11:43 AM, H. Peter Anvin wrote: > > On 06/25/2013 10:22 AM, Mike Travis wrote: > >> > >> On 6/25/2013 12:38 AM, Ingo Molnar wrote: > >>> > >>> * Nathan Zimmer <nzimmer@xxxxxxx> wrote: > >>> > >>>> On Sun, Jun 23, 2013 at 11:28:40AM +0200, Ingo Molnar wrote: > >>>>> > >>>>> That's 4.5 GB/sec initialization speed - that feels a bit slow and the > >>>>> boot time effect should be felt on smaller 'a couple of gigabytes' > >>>>> desktop boxes as well. Do we know exactly where the 2 hours of boot > >>>>> time on a 32 TB system is spent? > >>>> > >>>> There are other several spots that could be improved on a large system > >>>> but memory initialization is by far the biggest. > >>> > >>> My feeling is that deferred/on-demand initialization triggered from the > >>> buddy allocator is the better long term solution. > >> > >> I haven't caught up with all of Nathan's changes yet (just > >> got back from vacation), but there was an option to either > >> start the memory insertion on boot, or trigger it later > >> using the /sys/.../memory interface. There is also a monitor > >> program that calculates the memory insertion rate. This was > >> extremely useful to determine how changes in the kernel > >> affected the rate. > >> > > > > Sorry, I *totally* did not follow that comment. It seemed like a > > complete non-sequitur? > > > > -hpa > > It was I who was not following the question. I'm still reverting > back to "work mode". > > [There is more code in a separate patch that Nate has not sent > yet that instructs the kernel to start adding memory as early > as possible, or not. That way you can start the insertion process > later and monitor it's progress to determine how changes in the > kernel affect that process. It is controlled by a separate > CONFIG option.] So, just to repeat (and expand upon) the solution hpa and me suggests: it's not based on /sys, delayed initialization lists or any similar (essentially memory hot plug based) approach. It's a transparent on-demand initialization scheme based on only initializing the very early memory setup in 1GB (2MB) steps (not in 4K steps like we do it today). Any subsequent split-up initialization is done on-demand, in alloc_pages() et al, initilizing a batch of 512 (or 1024) struct page head's when an uninitialized portion is first encountered. This leaves the principle logic of early init largely untouched, we still have the same amount of RAM during and after bootup, except that on 32 TB systems we don't spend ~2 hours initializing 8,589,934,592 page heads. This scheme could be implemented by introducing a new PG_initialized flag, which is seen by an unlikely() branch in alloc_pages() and which triggers the on-demand initialization of pages. [ It could probably be made zero-cost for the post-initialization state: we already check a bunch of rare PG_ flags, one more flag would not introduce any new branch in the page allocation hot path. ] It's a technically different solution from what was submitted in this thread. Cons: - it works after bootup, via GFP. If done in a simple fashion it adds one more branch to the GFP fastpath. [ If done a bit more cleverly it can merge into an existing unlikely() branch and become essentially zero-cost for the fastpath. ] - it adds an initialization non-determinism to GFP, to the tune of initializing ~512 page heads when RAM is utilized first. - initialization is done when memory is needed - not during or shortly after bootup. This (slightly) increases first-use overhead. [I don't think this factor is significant - and I think we'll quickly see speedups to initialization, once the overhead becomes more easily measurable.] Pros: - it's transparent to the boot process. ('free' shows the same full amount of RAM all the time, there's no weird effects of RAM coming online asynchronously. You see all the RAM you have - etc.) - it helps the boot time of every single Linux system, not just large RAM ones. On a smallish, 4GB system memory init can take up precious hundreds of milliseconds, so this is a practical issue. - it spreads initialization overhead to later portions of the system's life time: when there's typically more idle time and more paralellism available. - initialization overhead, because it's a natural part of first-time memory allocation with this scheme, becomes more measurable (and thus more prominently optimized) than any deferred lists processed in the background. - as an added bonus it probably speeds up your usecase even more than the patches you are providing: on a 32 TB system the primary initialization would only have to enumerate memory, allocate page heads and buddy bitmaps, and initialize the 1GB granular page heads: there's only 32768 of them. So unless I overlooked some factor this scheme would be unconditional goodness for everyone. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html