On Sun, Sep 16, 2007 at 09:54:18PM +0100, Mel Gorman wrote: > The 16MB is the size of a hugepage, the size of interest as far as I am > concerned. Your idea makes sense for large block support, but much less > for huge pages because you are incurring a cost in the general case for > something that may not be used. Sorry for the misunderstanding, I totally agree! > There is nothing to say that both can't be done. Raise the size of > order-0 for large block support and continue trying to group the block > to make hugepage allocations likely succeed during the lifetime of the > system. Sure, I completely agree. > At the risk of repeating, your approach will be adding a new and > significant dimension to the internal fragmentation problem where a > kernel allocation may fail because the larger order-0 pages are all > being pinned by userspace pages. This is exactly correct, some memory will be wasted. It'll reach 0 free memory more quickly depending on which kind of applications are being run. > It improves the probabilty of hugepage allocations working because the > blocks with slab pages can be targetted and cleared if necessary. Agreed. > That suggestion was aimed at the large block support more than > hugepages. It helps large blocks because we'll be allocating and freeing > as more or less the same size. It certainly is easier to set > slub_min_order to the same size as what is needed for large blocks in > the system than introducing the necessary mechanics to allocate > pagetable pages and userspace pages from slab. Allocating userpages from slab in 4k chunks with a 64k PAGE_SIZE is really complex indeed. I'm not planning for that in the short term but it remains a possibility to make the kernel more generic. Perhaps it could worth it... Allocating ptes from slab is fairly simple but I think it would be better to allocate ptes in PAGE_SIZE (64k) chunks and preallocate the nearby ptes in the per-task local pagetable tree, to reduce the number of locks taken and not to enter the slab at all for that. Infact we could allocate the 4 levels (or anyway more than one level) in one single alloc_pages(0) and track the leftovers in the mm (or similar). > I'm not sure what you are getting at here. I think it would make more > sense if you said "when you read /proc/buddyinfo, you know the order-0 > pages are really free for use with large blocks" with your approach. I'm unsure who reads /proc/buddyinfo (that can change a lot and that is not very significant information if the vm can defrag well inside the reclaim code), but it's not much different and it's more about knowing the real meaning of /proc/meminfo, freeable (unmapped) cache, anon ram, and free memory. The idea is that to succeed an mmap over a large xfs file with mlockall being invoked, those largepages must become available or it'll be oom despite there are still 512M free... I'm quite sure admins will gets confused if they get oom killer invoked with lots of ram still free. The overcommit feature will also break, just to make an example (so much for overcommit 2 guaranteeing -ENOMEM retvals instead of oom killage ;). > All this aside, there is nothing mutually exclusive with what you are proposing > and what grouping pages by mobility does. Your stuff can exist even if grouping > pages by mobility is in place. In fact, it'll help us by giving an important > comparison point as grouping pages by mobility can be trivially disabled with > a one-liner for the purposes of testing. If your approach is brought to being > a general solution that also helps hugepage allocation, then we can revisit > grouping pages by mobility by comparing kernels with it enabled and without. Yes, I totally agree. It sounds worthwhile to have a good defrag logic in the VM. Even allocating the kernel stack in today kernels should be able to benefit from your work. It's just comparing a fork() failure (something that will happen with ulimit -n too and that apps must be able to deal with) with an I/O failure that worries me a bit. I'm quite sure a db failing I/O will not recover too nicely. If fork fails that's most certainly ok... at worst a new client won't be able to connect and he can retry later. Plus order 1 isn't really a big deal, you know the probability to succeeds decreases exponentially with the order. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html