On Fri, Nov 17, 2017 at 01:19:56PM -0500, Pavel Tatashin wrote: > On Thu, Nov 16, 2017 at 5:06 AM, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote: > > 4. Put a check into the page allocator slowpath that triggers serialised > > init if the system is booting and an allocation is about to fail. It > > would be such a cold path that it would never be noticable although it > > would leave dead code in the kernel image once boot had completed > > Hi Mel, > Hi Pavel, > The forth approach is the best as it is seamless for admins and > engineers, it will also work on any system configuration with any > parameters without any special involvement. > A lack of involvement from admins is indeed desirable. For example, while I might concede on using a disable-everything-switch, I would not be happy to introduce a switch that specified how much memory per node to initialise. For the forth approach, I really would be only thinking of a blunt "initialise everything instead of going OOM". I was wary of making things too complicated and I worried about some side-effects I'll cover later. > This approach will also address the following problem: > reset_deferred_meminit() has some assumptions about how much memory we > will need beforehand may break periodically as kernel requirements > change. For, instance, I recently reduced amount of memory system hash > tables take on large machines [1], so the comment in that function is > already outdated: > /* > * Initialise at least 2G of a node but also take into account that > * two large system hashes that can take up 1GB for 0.25TB/node. > */ > True, that could be updated although I would not necessarily alter the value to minimise the memory requirements either. I would simply make the comment a bit more general. More on this in a bit; > With this approach we could always init a very small amount of struct > pages, and allow the rest to be initialized on demand as boot requires > until deferred struct pages are initialized. Since, having deferred > pages feature assumes that the machine is large, there is no drawback > of having some extra byte of dead code, especially that all the checks > can be permanently switched of via static branches once deferred init > is complete. > This is where I fear there may be dragons. If we minimse the number of struct pages and initialise serially as necessary, there is a danger that we'll allocate remote memory in cases where local memory would have done because a remote node had enough memory. To offset that risk, it would be necessary at boot-time to force allocations from local node where possible and initialise more memory as necessary. That starts getting complicated because we'd need to adjust gfp-flags in the fast path with init-and-retry logic in the slow path and that could be a constant penalty. We could offset that in the fast path by using static branches but it's getting more and more complex for what is a minor optimisation -- shorter boot times on large machines where userspace itself could take a *long* time to get up and running (think database reading in 1TB of data from disk as it warms up). > The second benefit that this approach may bring is the following: it > may enable to add a new feature which would initialize struct pages on > demand later, when needed by applications. This feature would be > configurable or enabled via kernel parameter (not sure which is > better). > > if (allocation is failing) > if (uninit struct pages available) > init enought to finish alloc > There is a hidden hazard with this as well -- benchmarking. Early in the lifetime of the system, it's going to be slower because we're initialising memory while measuring performance when previously no such work would be necessary. While it's somewhat of a corner-case, it's still real and it would generate reports. For example, I got burned once by a "regression" that was due to ext4's lazy_init feature because IO benchmarks appeared slower when in reality, it was only due to a fresh filesystem initialising. It was necessary to turn off the feature at mkfs time to get accurate measurements. I think the same could happen with memory and we'd have to special case some things. We'd want to be *very* sure there was a substantial benefit to the complexity. For benchmarking a system, we'd also need to be able to turn it off. > Again, once all pages are initialized, the checks will be turned off > via static branching, so I think the code can be shared. > > Here is the rationale for this feature: > > Each physical machine may run a very large number of linux containers. > Steve Sistare (CCed), recently studied how much memory each instance > of clear container is taking, and it turns out to be about 125 MB, > when containers are booted with 2G of memory and 1 CPU. Out of those > 125 MB, 32 MB is consumed by struct page array as we use 64-bytes per > page. Admins tend to be protective in the amount of memory that is > configured, therefore they may over-commit the amount of memory that > is actually required by the container. So, by allowing struct pages to > be initialized only on demand, we can save around 25% of the memory > that is consumed by fresh instance of container. Now, that struct > pages are not zeroed during boot [2], and if we will implement the > forth option, we can get closer to implementing a complete on demand > struct page initialization. > > I can volunteer to work on these projects. > I accept the potential for packing more containers into the system but While I commend the work you've done so far, I'd be wary and warn you of going too far down this path. I wouldn't NAK patches going in this direction as long as they would eventually be behind static branches but I don't feel it's the most urgent problem to work on either. This is why, even if I took the fourth option, that it would be a blunt "init everything if we're going OOM" approach. However, that is my opinion and it's partially based on a lack of sensible use cases. I suspect you have better justification that would be included in changelogs. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>