On 8/22/19 10:04 AM, Michal Hocko wrote: > On Thu 22-08-19 01:55:25, Yang Shi wrote: >> Available memory is one of the most important metrics for memory >> pressure. > > I would disagree with this statement. It is a rough estimate that tells > how much memory you can allocate before going into a more expensive > reclaim (mostly swapping). Allocating that amount still might result in > direct reclaim induced stalls. I do realize that this is simple metric > that is attractive to use and works in many cases though. > >> Currently, the deferred split THPs are not accounted into >> available memory, but they are reclaimable actually, like reclaimable >> slabs. >> >> And, they seems very common with the common workloads when THP is >> enabled. A simple run with MariaDB test of mmtest with THP enabled as >> always shows it could generate over fifteen thousand deferred split THPs >> (accumulated around 30G in one hour run, 75% of 40G memory for my VM). >> It looks worth accounting in MemAvailable. > > OK, this makes sense. But your above numbers are really worrying. > Accumulating such a large amount of pages that are likely not going to > be used is really bad. They are essentially blocking any higher order > allocations and also push the system towards more memory pressure. > > IIUC deferred splitting is mostly a workaround for nasty locking issues > during splitting, right? This is not really an optimization to cache > THPs for reuse or something like that. What is the reason this is not > done from a worker context? At least THPs which would be freed > completely sound like a good candidate for kworker tear down, no? Agreed that it's a good question. For Kirill :) Maybe with kworker approach we also wouldn't need the cgroup awareness? >> Record the number of freeable normal pages of deferred split THPs into >> the second tail page, and account it into KReclaimable. Although THP >> allocations are not exactly "kernel allocations", once they are unmapped, >> they are in fact kernel-only. KReclaimable has been accounted into >> MemAvailable. > > This sounds reasonable to me. > >> When the deferred split THPs get split due to memory pressure or freed, >> just decrease by the recorded number. >> >> With this change when running program which populates 1G address space >> then madvise(MADV_DONTNEED) 511 pages for every THP, /proc/meminfo would >> show the deferred split THPs are accounted properly. >> >> Populated by before calling madvise(MADV_DONTNEED): >> MemAvailable: 43531960 kB >> AnonPages: 1096660 kB >> KReclaimable: 26156 kB >> AnonHugePages: 1056768 kB >> >> After calling madvise(MADV_DONTNEED): >> MemAvailable: 44411164 kB >> AnonPages: 50140 kB >> KReclaimable: 1070640 kB >> AnonHugePages: 10240 kB >> >> Suggested-by: Vlastimil Babka <vbabka@xxxxxxx> >> Cc: Michal Hocko <mhocko@xxxxxxxxxx> >> Cc: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx> >> Cc: Johannes Weiner <hannes@xxxxxxxxxxx> >> Cc: David Rientjes <rientjes@xxxxxxxxxx> >> Signed-off-by: Yang Shi <yang.shi@xxxxxxxxxxxxxxxxx> Thanks, looks like it wasn't too difficult with the 2nd tail page use :) ... >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -524,6 +524,7 @@ void prep_transhuge_page(struct page *page) >> >> INIT_LIST_HEAD(page_deferred_list(page)); >> set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR); >> + page[2].nr_freeable = 0; >> } >> >> static unsigned long __thp_get_unmapped_area(struct file *filp, unsigned long len, >> @@ -2766,6 +2767,10 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) >> if (!list_empty(page_deferred_list(head))) { >> ds_queue->split_queue_len--; >> list_del(page_deferred_list(head)); >> + __mod_node_page_state(page_pgdat(page), >> + NR_KERNEL_MISC_RECLAIMABLE, >> + -head[2].nr_freeable); >> + head[2].nr_freeable = 0; >> } >> if (mapping) >> __dec_node_page_state(page, NR_SHMEM_THPS); >> @@ -2816,11 +2821,14 @@ void free_transhuge_page(struct page *page) >> ds_queue->split_queue_len--; >> list_del(page_deferred_list(page)); >> } >> + __mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE, >> + -page[2].nr_freeable); >> + page[2].nr_freeable = 0; Wouldn't it be safer to fully tie the nr_freeable use to adding the page to the deffered list? So here the code would be in the if (!list_empty()) { } part above. >> spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); >> free_compound_page(page); >> } >> >> -void deferred_split_huge_page(struct page *page) >> +void deferred_split_huge_page(struct page *page, unsigned int nr) >> { >> struct deferred_split *ds_queue = get_deferred_split_queue(page); >> #ifdef CONFIG_MEMCG >> @@ -2844,6 +2852,9 @@ void deferred_split_huge_page(struct page *page) >> return; >> >> spin_lock_irqsave(&ds_queue->split_queue_lock, flags); >> + page[2].nr_freeable += nr; >> + __mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE, >> + nr); Same here, only do this when adding to the list, below? Or we might perhaps account base pages multiple times? >> if (list_empty(page_deferred_list(page))) { >> count_vm_event(THP_DEFERRED_SPLIT_PAGE); >> list_add_tail(page_deferred_list(page), &ds_queue->split_queue); >> diff --git a/mm/rmap.c b/mm/rmap.c >> index e5dfe2a..6008fab 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -1286,7 +1286,7 @@ static void page_remove_anon_compound_rmap(struct page *page) >> >> if (nr) { >> __mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, -nr); >> - deferred_split_huge_page(page); >> + deferred_split_huge_page(page, nr); >> } >> } >> >> @@ -1320,7 +1320,7 @@ void page_remove_rmap(struct page *page, bool compound) >> clear_page_mlock(page); >> >> if (PageTransCompound(page)) >> - deferred_split_huge_page(compound_head(page)); >> + deferred_split_huge_page(compound_head(page), 1); >> >> /* >> * It would be tidy to reset the PageAnon mapping here, >> -- >> 1.8.3.1 >