On Tue, 24 Nov 2020, Rik van Riel wrote: > The allocation flags of anonymous transparent huge pages can be controlled > through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can > help the system from getting bogged down in the page reclaim and compaction > code when many THPs are getting allocated simultaneously. > > However, the gfp_mask for shmem THP allocations were not limited by those > configuration settings, and some workloads ended up with all CPUs stuck > on the LRU lock in the page reclaim code, trying to allocate dozens of > THPs simultaneously. > > This patch applies the same configurated limitation of THPs to shmem > hugepage allocations, to prevent that from happening. > > This way a THP defrag setting of "never" or "defer+madvise" will result > in quick allocation failures without direct reclaim when no 2MB free > pages are available. > > With this patch applied, THP allocations for tmpfs will be a little > more aggressive than today for files mmapped with MADV_HUGEPAGE, > and a little less aggressive for files that are not mmapped or > mapped without that flag. > > v6: make khugepaged actually obey tmpfs mount flags > v5: reduce gfp mask further if needed, to accomodate i915 (Matthew Wilcox) > v4: rename alloc_hugepage_direct_gfpmask to vma_thp_gfp_mask (Matthew Wilcox) > v3: fix NULL vma issue spotted by Hugh Dickins & tested > v2: move gfp calculation to shmem_getpage_gfp as suggested by Yu Xu Andrew, please don't rush mmthpshmem-limit-shmem-thp-alloc-gfp_mask.patch mmthpshm-limit-gfp-mask-to-no-more-than-specified.patch mmthpshmem-make-khugepaged-obey-tmpfs-mount-flags.patch to Linus in your first wave of mmotm->5.11 sendings. Or, alternatively, go ahead and send them to Linus, but be aware that I'm fairly likely to want adjustments later. Sorry for limping along so far behind, but I still have more re-reading of the threads to do, and I'm still investigating why tmpfs huge=always becomes so ineffective in my testing with these changes, even if I ramp up from default defrag=madvise to defrag=always: 5.10 mmotm thp_file_alloc 4641788 216027 thp_file_fallback 275339 8895647 I've been looking into it off and on for weeks (gfp_mask wrangling is not my favourite task! so tend to find higher priorities to divert me); hoped to arrive at a conclusion before merge window, but still have nothing constructive to say yet, hence my silence so far. Above's "a little less aggressive" appears understatement at present. I respect what Rik is trying to achieve here, and I may end up concluding that there's nothing better to be done than what he has. My kind of hugepage-thrashing-in-low-memory may be so remote from normal usage, and could be skirting the latency horrors we all want to avoid: but I haven't reached that conclusion yet - the disparity in effectiveness still deserves more investigation. (There's also a specific issue with the gfp_mask limiting: I have not yet reviewed the allowing and denying in detail, but it looks like it does not respect the caller's GFP_ZONEMASK - the gfp in shmem_getpage_gfp() and shmem_read_mapping_page_gfp() is there to satisfy the gma500, which wanted to use shmem but could only manage DMA32. I doubt it wants THPS, but shmem_enabled=force forces them.) Thanks, Hugh