On 11/09/2018 04:13 AM, Kirill A. Shutemov wrote: > On Thu, Nov 08, 2018 at 10:48:58PM -0800, Anthony Yznaga wrote: >> The basic idea as outlined by Mel Gorman in [2] is: >> >> 1) On first fault in a sufficiently sized range, allocate a huge page >> sized and aligned block of base pages. Map the base page >> corresponding to the fault address and hold the rest of the pages in >> reserve. >> 2) On subsequent faults in the range, map the pages from the reservation. >> 3) When enough pages have been mapped, promote the mapped pages and >> remaining pages in the reservation to a huge page. >> 4) When there is memory pressure, release the unused pages from their >> reservations. > I haven't yet read the patch in details, but I'm skeptical about the > approach in general for few reasons: > > - PTE page table retracting to replace it with huge PMD entry requires > down_write(mmap_sem). It makes the approach not practical for many > multi-threaded workloads. > > I don't see a way to avoid exclusive lock here. I will be glad to > be proved otherwise. > > - The promotion will also require TLB flush which might be prohibitively > slow on big machines. > > - Short living processes will fail to benefit from THP with the policy, > even with plenty of free memory in the system: no time to promote to THP > or, with synchronous promotion, cost will overweight the benefit. > > The goal to reduce memory overhead of THP is admirable, but we need to be > careful not to kill THP benefit itself. The approach will reduce number of > THP mapped in the system and/or shift their allocation to later stage of > process lifetime. > > The only way I see it can be useful is if it will be possible to apply the > policy on per-VMA basis. It will be very useful for malloc() > implementations, for instance. But as a global policy it's no-go to me. I agree that this should not be a global policy. For example, it seems to me that a VMA where MADV_HUGEPAGE has been applied should get huge pages on first faults (I need to fix that in my implementation). > > Prove me wrong with performance data. :) I'll try. :-) Thanks for the comments! Anthony