On Sun, Apr 11, 2010 at 12:00:29AM +0300, Avi Kivity wrote: > IMO, both. The kernel should align vmas on 2MB boundaries (good for > small pages as well). glibc should use 2MB increments. Even on <2MB Agreed. > sized vmas, the kernel should reserve the large page frame for a while > in the hope that the application will use it in a short while. I don't see the need of this per-process, and the buddy logic is already doing exactly that for us... (even without the movable/unmovable fallback logic) > There are also guard pages around stacks IIRC, we could make them 2MB on > x86-64. Agreed. That will provide little benefit though, the stack usage is quite local near the top and few apps stores bulks of data there (hard to reach even 512k in size). Firefox has a 300k stack. It'll waste >1M per process. If it grows and the application is long lived khugepaged takes care of this already. But personally I tend to like a black/white approach as much as possible, so I agree to make the vma large enough immediately if enabled = always. > Well, but mapping a 2MB vma with a large page could be a considerable > waste if the application doesn't eventually use it. I'd like to map the > pages with small pages (belonging to a large frame) and if the > application actually uses the pages, switch to a large pte. > > Something that can also improve small pages is to prefault the vma with > small pages, but with the accessed and dirty bit cleared. Later, we > check those bits and reclaim the pages if they're unused, or coalesce > them if they were used. The nice thing is that we save tons of page > faults in the common case where the pages are used. Yeah we could do that. I'm not against it but it's not my preference to do these things. Anything that introduces the risk of performance regressions in corner cases frightens me. I prefer to pay with RAM anytime. Again I like to keep the design as black/white as possible, if somebody is ram constrained he shouldn't leave enabled=always but keep enabled=madvise. That's the whole point of having added a enabled = madvise, for who is ram constrained but wants to run faster anyway with zero ram-waste risk. These days even desktop systems have more ram than needed so I don't see the big deal, we should squeeze out of the ram every possible CPU cycle (even in the user stack even if likely not significant and just a RAM waste) and not waste CPU in pre-fault or migration of 4k to 2M pages when the vm_end grows and then having to find which unmapped pages of a hugepage to reclaim after splitting it on the fly. I want to reduce to the minimum the risk of regressions anywhere when full transparency is enabled. This also has the benefit of keeping the kernel code simpler and with less special cases ;). It may not be ideal if you've a 1G desktop system and you want to run faster when encoding a movie, but for that there's exactly madvise(MADV_HUGEPAGE). qemu-kvm/transcode/ffmpeg all can use a little madvise on their big chunks of memory. khugepaged should also learn to prioritize on those VM_HUGEPAGE vmas before scanning the rest (which it doesn't right now to keep it a bit simpler, but obviously there's room for improvement). Anyway I think I we can start with aligning the vmas that don't pad themselfs with previous vma, to 2M size, and have the stack also aligned so the page faults will fill them automatically. Changing glibc to grow 2m instead of 1m is a one liner change to a #define and it'll also halve the number of mmap syscalls so it's quite strightforward next step. I also need to make it numa aware with an alloc_pages_vma. Both are simple enough that I can do them right now without worries. Then we can re-think at making the kernel more complex. I don't mean it's bad idea, just less obvious than paying with RAM and be simpler... I want to be sure this is rock solid before we go ahead doing more complex stuff. There have been zero problems so far (backing out the anon-vma changes solved the only bug that triggered without memory compaction (showing a skew between the pmd_huge mappings and page_mapcount because of anon-vma errors), and memory compaction also works great now with the last integration fix ;). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>