Hello, this fixes a potential issue with regard to simultaneous 4k and 2M TLB entries in split_huge_page (at pratically zero cost, so I didn't need to add a fake feature flag and it's a lot safer to do it this way just in case). split_large_page in change_page_attr has the same issue too, but I've no idea how to fix it there because the pmd cannot be marked non present at any given time as change_page_attr may be running on ram below 640k and that is the same pmd where the kernel .text resides. However I doubt it'll ever be a practical problem. Other cpus also has a lot of warnings and risks in allowing simultaneous TLB entries of different size. Johannes also sent a cute optimization to split split_huge_page_vma/mm he converted those in a single split_huge_page_pmd and in addition he also sent native support for hugepages in both mincore and mprotect. Which shows how deep he already understands the whole huge_memory.c and its usage in the callers. Seeing significant contributions like this I think further confirms this is the way to go. Thanks a lot Johannes. The ability to bisect before the mincore and mprotect native implementations is one of the huge benefits of this approach. The hardest of all will be to add swap native support to 2M pages later (as it involves to make the swapcache 2M capable and that in turn means it expodes more than the rest all over the pagecache code) but I think first we've other priorities: 1) merge memory compaction 2) writing a HPAGE_PMD_ORDER front slab allocator. I don't think memory compaction is capable of relocating slab entries in-use (correct me if I'm wrong, I think it's impossible as long as the slab entries are mapped by 2M pages and not 4k ptes like vmalloc). So the idea is that we should have the slab allocate 2M if it fails, 1M if it fails 512k etc... until it fallbacks to 4k. Otherwise the slab will fragment the memory badly by allocating with alloc_page(). Basically the buddy allocator will guarantee the slab will generate as much fragement as possible because it does its best to keep the high order pages for who asks for them. Probably the fallback should happen inside the buddy allocator instead of calling alloc_pages repeteadly, that should avoid taking a flood of locks. Basically the buddy should give the worst possible fragmentation effect to users that should be relocated, while the other users that cannot be relocated and only use 4k pages will better use a front allocator on top of alloc_pages. Something like alloc_page_not_relocatable() that will do its stuff internally and try to keep those in the same 2M pages. This alone should help tremendously and I think it's orthogonal to the memory compaction of the relocatable stuff. Or maybe we should just live with a large chunk of the memory not being relocatable, but I like this idea because it's more dynamic and it won't have fixed rule "limit the slab to 0-1g range". And it'd tend to try to keep fragmentation down even if we spill over the 1G range. (1g is purely made up number) 3) teach ksm to merge hugepages. I talked about this with Izik and we agree the current ksm tree algorithm will be the best at that compared to ksm algorithms. To run KVM on top on this and take advantage of hugepages you need a few liner patch I posted to qemu-devel to take care of aligning the start of the guest memory so that the guest physical address and host virtual address will have the same subpage numbers. http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.34-rc2-mm1/transparent_hugepage-15 http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.34-rc2-mm1/transparent_hugepage-15.gz I'd be nice to have this merged in -mm. Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>