On Thu, Nov 08, 2018 at 10:48:58PM -0800, Anthony Yznaga wrote: > The basic idea as outlined by Mel Gorman in [2] is: > > 1) On first fault in a sufficiently sized range, allocate a huge page > sized and aligned block of base pages. Map the base page > corresponding to the fault address and hold the rest of the pages in > reserve. > 2) On subsequent faults in the range, map the pages from the reservation. > 3) When enough pages have been mapped, promote the mapped pages and > remaining pages in the reservation to a huge page. > 4) When there is memory pressure, release the unused pages from their > reservations. I haven't yet read the patch in details, but I'm skeptical about the approach in general for few reasons: - PTE page table retracting to replace it with huge PMD entry requires down_write(mmap_sem). It makes the approach not practical for many multi-threaded workloads. I don't see a way to avoid exclusive lock here. I will be glad to be proved otherwise. - The promotion will also require TLB flush which might be prohibitively slow on big machines. - Short living processes will fail to benefit from THP with the policy, even with plenty of free memory in the system: no time to promote to THP or, with synchronous promotion, cost will overweight the benefit. The goal to reduce memory overhead of THP is admirable, but we need to be careful not to kill THP benefit itself. The approach will reduce number of THP mapped in the system and/or shift their allocation to later stage of process lifetime. The only way I see it can be useful is if it will be possible to apply the policy on per-VMA basis. It will be very useful for malloc() implementations, for instance. But as a global policy it's no-go to me. Prove me wrong with performance data. :) -- Kirill A. Shutemov