Hello, On Fri, Nov 09, 2018 at 03:13:18PM +0300, Kirill A. Shutemov wrote: > On Thu, Nov 08, 2018 at 10:48:58PM -0800, Anthony Yznaga wrote: > > The basic idea as outlined by Mel Gorman in [2] is: > > > > 1) On first fault in a sufficiently sized range, allocate a huge page > > sized and aligned block of base pages. Map the base page > > corresponding to the fault address and hold the rest of the pages in > > reserve. > > 2) On subsequent faults in the range, map the pages from the reservation. > > 3) When enough pages have been mapped, promote the mapped pages and > > remaining pages in the reservation to a huge page. > > 4) When there is memory pressure, release the unused pages from their > > reservations. > > I haven't yet read the patch in details, but I'm skeptical about the > approach in general for few reasons: > > - PTE page table retracting to replace it with huge PMD entry requires > down_write(mmap_sem). It makes the approach not practical for many > multi-threaded workloads. > > I don't see a way to avoid exclusive lock here. I will be glad to > be proved otherwise. > > - The promotion will also require TLB flush which might be prohibitively > slow on big machines. > > - Short living processes will fail to benefit from THP with the policy, > even with plenty of free memory in the system: no time to promote to THP > or, with synchronous promotion, cost will overweight the benefit. > > The goal to reduce memory overhead of THP is admirable, but we need to be > careful not to kill THP benefit itself. The approach will reduce number of > THP mapped in the system and/or shift their allocation to later stage of > process lifetime. > > The only way I see it can be useful is if it will be possible to apply the > policy on per-VMA basis. It will be very useful for malloc() > implementations, for instance. But as a global policy it's no-go to me. I'm also skeptical about this: the current design is quite intentional. It's not a bug but a feature that we're not doing the promotion. Part of the tradeoff with THP is to use more RAM to save CPU, when you use less RAM you're inherently already wasting some CPU just for the reservation management and you don't get the immediate TLB benefit anymore either. And if you're in the camp that is concerned about the use of more RAM or/and about the higher latency of COW faults, I'm afraid the intermediate solution will be still slower than the already available MADV_NOHUGEPAGE or enabled=madvise. Apps like redis that will use more RAM during snapshot and that are slowed down with THP needs to simply use MADV_NOHUGEPAGE which already exists as an madvise from the very first kernel that supported THP-anon. Same thing for other apps that use more RAM with THP and that are on the losing end of the tradeoff. Now about the implementation: the whole point of the reservation complexity is to skip the khugepaged copy, so it can collapse in place. Is skipping the copy worth it? Isn't the big cost the IPI anyway to avoid leaving two simultaneous TLB mappings of different granularity? khugepaged is already tunable to specify a ratio of memory in use to avoid wasting memory /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none. If you set max_ptes_none to half the default value, it'll only promote pages that are half mapped, reducing the memory waste to 50% of what it is by default. So if you are ok to copy the memory that you promote to THP, you'd just need a global THP mode to avoid allocating THP even when they're available during the page fault (while still allowing khugepaged to collapse hugepages in the background), and then reduce max_ptes_none to get the desired promotion ratio. Doing the copy will avoid the reservation there will be also more THP available to use for those khugepaged users without losing them in reservations. You won't have to worry about what to do when there's memory pressure because you won't have to undo the reservation because there was no reservation in the first place. That problem also goes away with the copy. So it sounds like you could achieve a similar runtime behavior with much less complexity by reducing max_ptes_none and by doing the copy and dropping all reservation code. > Prove me wrong with performance data. :) Same here. Thanks, Andrea