From: Alexander Zhu <alexlzhu@xxxxxx> Changelog: v5 to v6 -removed PageSwapCache check from add_underutilized_thp as split_huge_page takes care of this already. -added check for PageHuge in add_underutilized_thp to account for hugetlbfs pages. -added Yu Zhao as author for the second patch v4 to v5 -split out split_huge_page changes into three different patches. One for zapping zero pages, one for not remapping zero pages, and one for self tests. -fixed bug with lru_to_folio, was corrupting the folio -fixed bug with memchr_inv in mm/thp_utilization. zero page should mean !memchr_inv(kaddr, 0, PAGE_SIZE) v3 to v4 -changed thp_utilization_bucket() function to take folios, saves conversion between page and folio -added newlines where they were previously missing in v2-v3 -moved the thp utilization code out into its own file under mm/thp_utilization.c -removed is_anonymous_transparent_hugepage function. Use folio_test_anon and folio_test_trans_huge instead. -changed thp_number_utilized_pages to use memchr_inv -added some comments regardling trylock -change the relock to be unconditional in low_util_free_page -only expose can_shrink_thp, abstract the thp_utilization and bucket logic to be private to mm/thp_utilization.c v2 to v3 -put_page() after trylock_page in low_util_free_page. put() to be called after get() call -removed spin_unlock_irq in low_util_free_page above LRU_SKIP. There was a double unlock. -moved spin_unlock_irq() to below list_lru_isolate() in low_util_free_page. This is to shorten the critical section. -moved lock_page in add_underutilized_thp such that we only lock when allocating and adding to the list_lru -removed list_lru_alloc in list_lru_add_page and list_lru_delete_page as these are no longer needed. v1 to v2 -reversed ordering of is_transparent_hugepage and PageAnon in is_anon_transparent_hugepage, page->mapping is only meaningful for user pages -only trigger the unmap_clean/zap in split_huge_page on anonymous THPs. We cannot zap zero pages for file THPs. -modified split_huge_page self test based off more recent changes. -Changed lru_lock to be irq safe. Added irq_save and restore around list_lru adds/deletes. -Changed low_util_free_page() to trylock the page, and if it fails, unlock lru_lock and return LRU_SKIP. This is to avoid deadlock between reclaim, which calls split_huge_page() and the THP Shrinker -Changed low_util_free_page() to unlock lru_lock, split_huge_page, then lock lru_lock. This way split_huge_page is not called with the lru_lock held. That leads to deadlock as split_huge_page calls on_each_cpu_mask -Changed list_lru_shrink_walk to list_lru_shrink_walk_irq. RFC to v1 -refactored out the code to obtain the thp_utilization_bucket, as that now has to be used in multiple places. -added support to map to the read only zero page when splitting a THP registered with userfaultfd. -added a self test to verify that userfaultfd change is working. -Remove all THPs that are not in the top utilization bucket. This is what we have found to perform the best in production testing, we have found that there are an almost trivial number of THPs in the middle range of buckets that account for most of the memory waste. -Added check for THP utilization prior to split_huge_page for the THP Shrinker. This is to account for THPs that move to the top bucket, but were underutilized at the time they were added to the list_lru. -Multiply the shrink_count and scan_count by HPAGE_PMD_NR. This is because a THP is 512 pages, and should count as 512 objects in reclaim. This way reclaim is triggered at a more appropriate frequency than in the RFC. Transparent Hugepages use a larger page size of 2MB in comparison to normal sized pages that are 4kb. A larger page size allows for fewer TLB cache misses and thus more efficient use of the CPU. Using a larger page size also results in more memory waste, which can hurt performance in some use cases. THPs are currently enabled in the Linux Kernel by applications in limited virtual address ranges via the madvise system call. The THP shrinker tries to find a balance between increased use of THPs, and increased use of memory. It shrinks the size of memory by removing the underutilized THPs that are identified by the thp_utilization scanner. In our experiments we have noticed that the least utilized THPs are almost entirely unutilized. Sample Output: Utilized[0-50]: 1331 680884 Utilized[51-101]: 9 3983 Utilized[102-152]: 3 1187 Utilized[153-203]: 0 0 Utilized[204-255]: 2 539 Utilized[256-306]: 5 1135 Utilized[307-357]: 1 192 Utilized[358-408]: 0 0 Utilized[409-459]: 1 57 Utilized[460-512]: 400 13 Last Scan Time: 223.98s Last Scan Duration: 70.65s Above is a sample obtained from one of our test machines when THP is always enabled. Of the 1331 THPs in this thp_utilization sample that have from 0-50 utilized subpages, we see that there are 680884 free pages. This comes out to 680884 / (512 * 1331) = 99.91% zero pages in the least utilized bucket. This represents 680884 * 4KB = 2.7GB memory waste. Also note that the vast majority of pages are either in the least utilized [0-50] or most utilized [460-512] buckets. The least utilized THPs are responsible for almost all of the memory waste when THP is always enabled. Thus by clearing out THPs in the lowest utilization bucket we extract most of the improvement in CPU efficiency. We have seen similar results on our production hosts. This patchset introduces the THP shrinker we have developed to identify and split the least utilized THPs. It includes the thp_utilization changes that groups anonymous THPs into buckets, the split_huge_page() changes that identify and zap zero 4KB pages within THPs and the shrinker changes. It should be noted that the split_huge_page() changes are based off previous work done by Yu Zhao. In the future, we intend to allow additional tuning to the shrinker based on workload depending on CPU/IO/Memory pressure and the amount of anonymous memory. The long term goal is to eventually always enable THP for all applications and deprecate madvise entirely. In production we thus far have observed 2-3% reduction in overall cpu usage on stateless web servers when THP is always enabled. Alexander Zhu (4): mm: add thp_utilization metrics to debugfs mm: do not remap clean subpages when splitting isolated thp mm: add selftests to split_huge_page() to verify unmap/zap of zero pages mm: THP low utilization shrinker Yu Zhao (1): mm: changes to split_huge_page() to free zero filled tail pages Documentation/admin-guide/mm/transhuge.rst | 9 + include/linux/huge_mm.h | 9 + include/linux/list_lru.h | 24 ++ include/linux/mm_types.h | 5 + include/linux/rmap.h | 2 +- include/linux/vm_event_item.h | 3 + mm/Makefile | 2 +- mm/huge_memory.c | 156 +++++++++++- mm/list_lru.c | 49 ++++ mm/migrate.c | 73 +++++- mm/migrate_device.c | 4 +- mm/page_alloc.c | 6 + mm/thp_utilization.c | 222 ++++++++++++++++++ mm/vmstat.c | 3 + .../selftests/vm/split_huge_page_test.c | 115 ++++++++- tools/testing/selftests/vm/vm_util.c | 23 ++ tools/testing/selftests/vm/vm_util.h | 3 + 17 files changed, 690 insertions(+), 18 deletions(-) create mode 100644 mm/thp_utilization.c -- 2.30.2