The patch titled Subject: mm, thp: account deferred split THPs into MemAvailable has been added to the -mm tree. Its filename is mm-account-deferred-split-thps-into-memavailable.patch This patch should soon appear at http://ozlabs.org/~akpm/mmots/broken-out/mm-account-deferred-split-thps-into-memavailable.patch and later at http://ozlabs.org/~akpm/mmotm/broken-out/mm-account-deferred-split-thps-into-memavailable.patch Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next and is updated there every 3-4 working days ------------------------------------------------------ From: Yang Shi <yang.shi@xxxxxxxxxxxxxxxxx> Subject: mm, thp: account deferred split THPs into MemAvailable Available memory is one of the most important metrics for memory pressure. Currently the deferred split THPs are not accounted into available memory, but they are actually reclaimable, like reclaimable slabs. And they seem very common with the common workloads when THP is enabled. A simple run with MariaDB test of mmtest with THP enabled as always shows it could generate over fifteen thousand deferred split THPs (accumulated around 30G in one hour run, 75% of 40G memory for my VM). It looks worth accounting in MemAvailable. Record the number of freeable normal pages of deferred split THPs into the second tail page, and account it into KReclaimable. Although THP allocations are not exactly "kernel allocations", once they are unmapped, they are in fact kernel-only. KReclaimable has been accounted into MemAvailable. When the deferred split THPs get split due to memory pressure or freed, just decrease by the recorded number. With this change when running program which populates 1G address space then madvise(MADV_DONTNEED) 511 pages for every THP, /proc/meminfo would show the deferred split THPs are accounted properly. Populated by before calling madvise(MADV_DONTNEED): MemAvailable: 43531960 kB AnonPages: 1096660 kB KReclaimable: 26156 kB AnonHugePages: 1056768 kB After calling madvise(MADV_DONTNEED): MemAvailable: 44411164 kB AnonPages: 50140 kB KReclaimable: 1070640 kB AnonHugePages: 10240 kB Link: http://lkml.kernel.org/r/1566410125-66011-1-git-send-email-yang.shi@xxxxxxxxxxxxxxxxx Signed-off-by: Yang Shi <yang.shi@xxxxxxxxxxxxxxxxx> Suggested-by: Vlastimil Babka <vbabka@xxxxxxx> Cc: Michal Hocko <mhocko@xxxxxxxxxx> Cc: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx> Cc: Johannes Weiner <hannes@xxxxxxxxxxx> Cc: David Rientjes <rientjes@xxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- Documentation/filesystems/proc.txt | 4 ++-- include/linux/huge_mm.h | 7 +++++-- include/linux/mm_types.h | 3 ++- mm/huge_memory.c | 13 ++++++++++++- mm/rmap.c | 4 ++-- 5 files changed, 23 insertions(+), 8 deletions(-) --- a/Documentation/filesystems/proc.txt~mm-account-deferred-split-thps-into-memavailable +++ a/Documentation/filesystems/proc.txt @@ -968,8 +968,8 @@ ShmemHugePages: Memory used by shared me with huge pages ShmemPmdMapped: Shared memory mapped into userspace with huge pages KReclaimable: Kernel allocations that the kernel will attempt to reclaim - under memory pressure. Includes SReclaimable (below), and other - direct allocations with a shrinker. + under memory pressure. Includes SReclaimable (below), deferred + split THPs, and other direct allocations with a shrinker. Slab: in-kernel data structures cache SReclaimable: Part of Slab, that might be reclaimed, such as caches SUnreclaim: Part of Slab, that cannot be reclaimed on memory pressure --- a/include/linux/huge_mm.h~mm-account-deferred-split-thps-into-memavailable +++ a/include/linux/huge_mm.h @@ -162,7 +162,7 @@ static inline int split_huge_page(struct { return split_huge_page_to_list(page, NULL); } -void deferred_split_huge_page(struct page *page); +void deferred_split_huge_page(struct page *page, unsigned int nr); void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, unsigned long address, bool freeze, struct page *page); @@ -324,7 +324,10 @@ static inline int split_huge_page(struct { return 0; } -static inline void deferred_split_huge_page(struct page *page) {} +static inline void deferred_split_huge_page(struct page *page, unsigned int nr) +{ +} + #define split_huge_pmd(__vma, __pmd, __address) \ do { } while (0) --- a/include/linux/mm_types.h~mm-account-deferred-split-thps-into-memavailable +++ a/include/linux/mm_types.h @@ -138,7 +138,8 @@ struct page { }; struct { /* Second tail page of compound page */ unsigned long _compound_pad_1; /* compound_head */ - unsigned long _compound_pad_2; + /* Freeable normal pages for deferred split shrinker */ + unsigned long nr_freeable; /* For both global and memcg */ struct list_head deferred_list; }; --- a/mm/huge_memory.c~mm-account-deferred-split-thps-into-memavailable +++ a/mm/huge_memory.c @@ -524,6 +524,7 @@ void prep_transhuge_page(struct page *pa INIT_LIST_HEAD(page_deferred_list(page)); set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR); + page[2].nr_freeable = 0; } static unsigned long __thp_get_unmapped_area(struct file *filp, unsigned long len, @@ -2795,6 +2796,10 @@ int split_huge_page_to_list(struct page if (!list_empty(page_deferred_list(head))) { ds_queue->split_queue_len--; list_del(page_deferred_list(head)); + __mod_node_page_state(page_pgdat(page), + NR_KERNEL_MISC_RECLAIMABLE, + -head[2].nr_freeable); + head[2].nr_freeable = 0; } if (mapping) __dec_node_page_state(page, NR_SHMEM_THPS); @@ -2845,11 +2850,14 @@ void free_transhuge_page(struct page *pa ds_queue->split_queue_len--; list_del(page_deferred_list(page)); } + __mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE, + -page[2].nr_freeable); + page[2].nr_freeable = 0; spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); free_compound_page(page); } -void deferred_split_huge_page(struct page *page) +void deferred_split_huge_page(struct page *page, unsigned int nr) { struct deferred_split *ds_queue = get_deferred_split_queue(page); #ifdef CONFIG_MEMCG @@ -2873,6 +2881,9 @@ void deferred_split_huge_page(struct pag return; spin_lock_irqsave(&ds_queue->split_queue_lock, flags); + page[2].nr_freeable += nr; + __mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE, + nr); if (list_empty(page_deferred_list(page))) { count_vm_event(THP_DEFERRED_SPLIT_PAGE); list_add_tail(page_deferred_list(page), &ds_queue->split_queue); --- a/mm/rmap.c~mm-account-deferred-split-thps-into-memavailable +++ a/mm/rmap.c @@ -1287,7 +1287,7 @@ static void page_remove_anon_compound_rm if (nr) { __mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, -nr); - deferred_split_huge_page(page); + deferred_split_huge_page(page, nr); } } @@ -1321,7 +1321,7 @@ void page_remove_rmap(struct page *page, clear_page_mlock(page); if (PageTransCompound(page)) - deferred_split_huge_page(compound_head(page)); + deferred_split_huge_page(compound_head(page), 1); /* * It would be tidy to reset the PageAnon mapping here, _ Patches currently in -mm which might be from yang.shi@xxxxxxxxxxxxxxxxx are mm-thp-extract-split_queue_-into-a-struct.patch mm-move-mem_cgroup_uncharge-out-of-__page_cache_release.patch mm-shrinker-make-shrinker-not-depend-on-memcg-kmem.patch mm-thp-make-deferred-split-shrinker-memcg-aware.patch mm-account-deferred-split-thps-into-memavailable.patch