Re: [RFC] mm: introduce page pinner

胡玮文 <huww98@xxxxxxxxxxx> · Fri, 10 Dec 2021 17:54:24 +0800

On Wed, Dec 08, 2021 at 10:42:37AM -0800, Minchan Kim wrote:
> On Wed, Dec 08, 2021 at 07:54:35PM +0800, 胡玮文 wrote:
> > On Mon, Dec 06, 2021 at 10:47:30AM -0800, Minchan Kim wrote:
> > > The contiguous memory allocation fails if one of the pages in
> > > requested range has unexpected elevated reference count since
> > > VM couldn't migrate the page out. It's very common pattern for
> > > CMA allocation failure. The temporal elevated page refcount
> > > could happen from various places and it's really hard to chase
> > > who held the temporal page refcount at that time, which is the
> > > vital information to debug the allocation failure.
> 
> Hi,
> 
> Please don't cut down original Cc list without special reason.

Sorry, my school SMTP server does not allow that much recipients. I haved
changed to outlook.

> > Hi Minchan,
> > 
> > I'm a newbie here. We are debugging a problem where every CPU core is doing
> > compaction but making no progress, because of the unexpected page refcount. I'm
> > interested in your approach, but this patch seems only to cover the CMA
> > allocation path. So could it be extended to debugging migrate failure during
> > compaction?  I'm not familiar with the kernel codebase, here is my untested
> > thought:
> 
> The compaction failure will produce a lot events I wanted to avoid
> in my system but I think your case is reasonable if you doesn't
> mind the large events.
> 
> > 
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index cf25b00f03c8..85dacbca8fa0 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -46,6 +46,7 @@
> >  #include <linux/mmu_notifier.h>
> >  #include <linux/page_idle.h>
> >  #include <linux/page_owner.h>
> > +#include <linux/page_pinner.h>
> >  #include <linux/sched/mm.h>
> >  #include <linux/ptrace.h>
> >  #include <linux/oom.h>
> > @@ -388,8 +389,10 @@ int folio_migrate_mapping(struct address_space *mapping,
> >  
> >         if (!mapping) {
> >                 /* Anonymous page without mapping */
> > -               if (folio_ref_count(folio) != expected_count)
> > +               if (folio_ref_count(folio) != expected_count) {
> > +                       page_pinner_failure(&folio->page);
> >                         return -EAGAIN;
> > +               }
> >  
> >                 /* No turning back from here */
> >                 newfolio->index = folio->index;
> > @@ -406,6 +409,7 @@ int folio_migrate_mapping(struct address_space *mapping,
> >         xas_lock_irq(&xas);
> >         if (!folio_ref_freeze(folio, expected_count)) {
> >                 xas_unlock_irq(&xas);
> > +               page_pinner_failure(&folio->page);
> >                 return -EAGAIN;
> >         }
> > 
> > I'm not sure what to do with the new folio, it seems using folio->page in new
> > codes is not correct.

I tested the above proposed patch, it works in my case. But it produces a lot of
redundant page_pinner_put events. Before the true pinner reveals, the traced
pages are get and put multiple times. Besides, when passed to
page_pinner_failure(), the "count" is 3 in my case, any of the 3 holders could
be the interested pinner. I think this is hard to avoid, and we can just let the
users distinguish which is the interested pinner. Maybe we need some docs about
this.

> If you want to cover compaction only, maybe this one:
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index bfc93da1c2c7..7bfbf7205fb8 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2400,6 +2400,11 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
>                 /* All pages were either migrated or will be released */
>                 cc->nr_migratepages = 0;
>                 if (err) {
> +                       struct page *failed_page;
> +
> +                       list_for_each_entry(failed_page, &cc->migratepages, lru)
> +                               page_pinner_failure(failed_page);
> +
>                         putback_movable_pages(&cc->migratepages);
>                         /*
>                          * migrate_pages() may return -ENOMEM when scanners meet

Maybe we should put the page_pinner_failure() calls as close to the real
refcount check as possible to avoid protential racing and loss some
page_pinner_put events? Besides, migrate_pages() will retry for 10 times, and I
image that someone may want to find out who is causing the retry. And migration
may fail for a number of reason, not only unexpected refcount.

I image that enabling page pinner for migration senarios other than compaction
could be helpful for others.

> However, for the case, I want to introduce some filter options like
> failure reason(?)
> 
>     page_pinner_failure(pfn, reason)
> 
> So,  I could keep getting only CMA allocation failure events, not
> compaction failure.

This is a good idea to me. But how can we implement the filter? Can we reuse the
trace event filter? i.e., if the page_pinner_failure event is filtered out, then
we don't set the PAGE_EXT_PINNER flag and effectively also filter the
corresponding page_pinner_put event out. I can't see whether it is possible now.
trace_page_pinner_failure() returns void, so it seems we cannot know whether the
event got through.

If this is not possible, we may need to allocate additional space to store the
reason for each traced page, and also pass the reason to trace_page_pinner_put().

> > > This patch introduces page pinner to keep track of Page Pinner
> > > who caused the CMA allocation failure. How page pinner work is
> > > once VM found the non-migrated page after trying migration
> > > during contiguos allocation, it marks the page and every page-put
> > > operation on the page since then will have event trace. Since
> > > page-put is always with page-get, the page-put event trace helps
> > > to deduce where the pair page-get originated from.
> > > 
> > > The reason why the feature tracks page-put instead of page-get
> > > indirectly is that since VM couldn't expect when page migration
> > > fails, it should keep track of every page-get for migratable page
> > > to dump information at failure. Considering backtrace as vitial
> > > information as well as page's get/put is one of hottest path,
> > > it's too heavy approach. Thus, to minimize runtime overhead,
> > > this feature adds a new PAGE_EXT_PINNER flag under PAGE_EXT
> > > debugging option to indicate migration-failed page and only
> > > tracks every page-put operation for the page since the failure.
> > > 
> > > usage:
> > > 
> > > trace_dir="/sys/kernel/tracing"
> > > echo 1 > $trace_dir/events/page_pinner/enable
> > > echo 1 > $trace_dir/options/stacktrace
> > > ..
> > > run workload
> > > ..
> > > ..
> > > 
> > > cat $trace_dir/trace
> > > 
> > >            <...>-498     [006] .... 33306.301621: page_pinner_failure: pfn=0x9f0bb0 flags=uptodate|lru|swapbacked count=1 mapcount=0 mapping=00000000aec7812a mt=5
> > >            <...>-498     [006] .... 33306.301625: <stack trace>
> > >  => __page_pinner_failure
> > >  => test_pages_isolated
> > >  => alloc_contig_range
> > >  => cma_alloc
> > >  => cma_heap_allocate
> > >  => dma_heap_ioctl
> > >  => __arm64_sys_ioctl
> > >  => el0_svc_common
> > >  => do_el0_svc
> > >  => el0_svc
> > >  => el0_sync_handler
> > >  => el0_sync
> > >            <...>-24965   [001] .... 33306.392836: page_pinner_put: pfn=0x9f0bb0 flags=uptodate|lru|swapbacked count=0 mapcount=0 mapping=00000000aec7812a mt=5
> > >            <...>-24965   [001] .... 33306.392846: <stack trace>
> > >  => __page_pinner_put
> > >  => release_pages
> > >  => free_pages_and_swap_cache
> > >  => tlb_flush_mmu_free
> > >  => tlb_flush_mmu
> > >  => zap_pte_range
> > >  => unmap_page_range
> > >  => unmap_vmas
> > >  => exit_mmap
> > >  => __mmput
> > >  => mmput
> > >  => exit_mm
> > >  => do_exit
> > >  => do_group_exit
> > >  => get_signal
> > >  => do_signal
> > >  => do_notify_resume
> > >  => work_pending
> > > 
> > > Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx>
> > > ---
> > > The PagePinner named after PageOwner since I wanted to keep track of
> > > page refcount holder. Feel free to suggest better names.
> > > Actually, I had alloc_contig_failure tracker as a candidate.
> > > 
> > >  include/linux/mm.h                 |  7 ++-
> > >  include/linux/page_ext.h           |  3 +
> > >  include/linux/page_pinner.h        | 47 ++++++++++++++++
> > >  include/trace/events/page_pinner.h | 60 ++++++++++++++++++++
> > >  mm/Kconfig.debug                   | 13 +++++
> > >  mm/Makefile                        |  1 +
> > >  mm/page_alloc.c                    |  3 +
> > >  mm/page_ext.c                      |  4 ++
> > >  mm/page_isolation.c                |  3 +
> > >  mm/page_pinner.c                   | 90 ++++++++++++++++++++++++++++++
> > >  10 files changed, 230 insertions(+), 1 deletion(-)
> > >  create mode 100644 include/linux/page_pinner.h
> > >  create mode 100644 include/trace/events/page_pinner.h
> > >  create mode 100644 mm/page_pinner.c
> 
> < snip>
> 
> > > diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
> > > index 1e73717802f8..0ad4a3b8f4eb 100644
> > > --- a/mm/Kconfig.debug
> > > +++ b/mm/Kconfig.debug
> > > @@ -62,6 +62,19 @@ config PAGE_OWNER
> > >  
> > >  	  If unsure, say N.
> > >  
> > > +config PAGE_PINNER
> > > +	bool "Track page pinner"
> > > +	select PAGE_EXTENSION
> > > +	depends on DEBUG_KERNEL && TRACEPOINTS
> > > +	help
> > > +	  This keeps track of what call chain is the pinner of a page, may
> > > +	  help to find contiguos page allocation failure. Even if you include
> > > +	  this feature in your build, it is disabled by default. You should
> > > +	  pass "page_pinner=on" to boot parameter in order to enable it. Eats
> > > +	  a fair amount of memory if enabled.
> > 
> > I'm a bit confused. It seems page pinner does not allocate any additional
> > memory if you enable it by boot parameter. So the description seems inaccurate.
> 
> It will allocate page_ext descriptors so consumes the memory.

Thanks, I see. So it is 8 bytes for each 4k page. Not much I think.

> > > +
> > > +	  If unsure, say N.
> > > +
> > >  config PAGE_POISONING
> > >  	bool "Poison pages after freeing"
> > >  	help
> > > diff --git a/mm/Makefile b/mm/Makefile
> > > index fc60a40ce954..0c9b78b15070 100644
> > > --- a/mm/Makefile
> > > +++ b/mm/Makefile
> > > @@ -102,6 +102,7 @@ obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
> > >  obj-$(CONFIG_DEBUG_RODATA_TEST) += rodata_test.o
> > >  obj-$(CONFIG_DEBUG_VM_PGTABLE) += debug_vm_pgtable.o
> > >  obj-$(CONFIG_PAGE_OWNER) += page_owner.o
> > > +obj-$(CONFIG_PAGE_PINNER) += page_pinner.o
> > >  obj-$(CONFIG_CLEANCACHE) += cleancache.o
> > >  obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
> > >  obj-$(CONFIG_ZPOOL)	+= zpool.o
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index f41a5e990ac0..6e3a6f875a40 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -63,6 +63,7 @@
> > >  #include <linux/sched/rt.h>
> > >  #include <linux/sched/mm.h>
> > >  #include <linux/page_owner.h>
> > > +#include <linux/page_pinner.h>
> > >  #include <linux/kthread.h>
> > >  #include <linux/memcontrol.h>
> > >  #include <linux/ftrace.h>
> > > @@ -1299,6 +1300,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
> > >  		if (memcg_kmem_enabled() && PageMemcgKmem(page))
> > >  			__memcg_kmem_uncharge_page(page, order);
> > >  		reset_page_owner(page, order);
> > > +		reset_page_pinner(page, order);
> > >  		return false;
> > >  	}
> > >  
> > > @@ -1338,6 +1340,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
> > >  	page_cpupid_reset_last(page);
> > >  	page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > >  	reset_page_owner(page, order);
> > > +	reset_page_pinner(page, order);
> > >  
> > >  	if (!PageHighMem(page)) {
> > >  		debug_check_no_locks_freed(page_address(page),
> > > diff --git a/mm/page_ext.c b/mm/page_ext.c
> > > index 2a52fd9ed464..0dafe968b212 100644
> > > --- a/mm/page_ext.c
> > > +++ b/mm/page_ext.c
> > > @@ -8,6 +8,7 @@
> > >  #include <linux/kmemleak.h>
> > >  #include <linux/page_owner.h>
> > >  #include <linux/page_idle.h>
> > > +#include <linux/page_pinner.h>
> > >  
> > >  /*
> > >   * struct page extension
> > > @@ -75,6 +76,9 @@ static struct page_ext_operations *page_ext_ops[] = {
> > >  #if defined(CONFIG_PAGE_IDLE_FLAG) && !defined(CONFIG_64BIT)
> > >  	&page_idle_ops,
> > >  #endif
> > > +#ifdef CONFIG_PAGE_PINNER
> > > +	&page_pinner_ops,
> > > +#endif
> > >  };
> > >  
> > >  unsigned long page_ext_size = sizeof(struct page_ext);
> > > diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> > > index a95c2c6562d0..a9ddea1c9166 100644
> > > --- a/mm/page_isolation.c
> > > +++ b/mm/page_isolation.c
> > > @@ -9,6 +9,7 @@
> > >  #include <linux/memory.h>
> > >  #include <linux/hugetlb.h>
> > >  #include <linux/page_owner.h>
> > > +#include <linux/page_pinner.h>
> > >  #include <linux/migrate.h>
> > >  #include "internal.h"
> > >  
> > > @@ -310,6 +311,8 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
> > >  
> > >  out:
> > >  	trace_test_pages_isolated(start_pfn, end_pfn, pfn);
> > > +	if (ret < 0)
> > > +		page_pinner_failure(pfn_to_page(pfn));
> > >  
> > >  	return ret;
> > >  }
> > > diff --git a/mm/page_pinner.c b/mm/page_pinner.c
> > > new file mode 100644
> > > index 000000000000..300a90647557
> > > --- /dev/null
> > > +++ b/mm/page_pinner.c
> > > @@ -0,0 +1,90 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +#include <linux/mm.h>
> > > +#include <linux/page_pinner.h>
> > > +
> > > +#define CREATE_TRACE_POINTS
> > > +#include <trace/events/page_pinner.h>
> > > +
> > > +static bool page_pinner_enabled;
> > > +DEFINE_STATIC_KEY_FALSE(page_pinner_inited);
> > > +EXPORT_SYMBOL(page_pinner_inited);
> > > +
> > > +static int __init early_page_pinner_param(char *buf)
> > > +{
> > > +	return kstrtobool(buf, &page_pinner_enabled);
> > > +}
> > > +early_param("page_pinner", early_page_pinner_param);
> > > +
> > > +static bool need_page_pinner(void)
> > > +{
> > > +	return page_pinner_enabled;
> > > +}
> > > +
> > > +static void init_page_pinner(void)
> > > +{
> > > +	if (!page_pinner_enabled)
> > > +		return;
> > > +
> > > +	static_branch_enable(&page_pinner_inited);
> > > +}
> > > +
> > > +struct page_ext_operations page_pinner_ops = {
> > > +	.need = need_page_pinner,
> > > +	.init = init_page_pinner,
> > > +};
> > > +
> > > +void __reset_page_pinner(struct page *page, unsigned int order)
> > > +{
> > > +	struct page_ext *page_ext;
> > > +	int i;
> > > +
> > > +	page_ext = lookup_page_ext(page);
> > > +	if (unlikely(!page_ext))
> > > +		return;
> > > +
> > > +	for (i = 0; i < (1 << order); i++) {
> > > +		if (!test_bit(PAGE_EXT_PINNER, &page_ext->flags))
> > > +			break;
> > > +
> > > +		clear_bit(PAGE_EXT_PINNER, &page_ext->flags);
> > > +		page_ext = page_ext_next(page_ext);
> > > +	}
> > > +}
> > > +
> > > +void __page_pinner_failure(struct page *page)
> > > +{
> > > +	struct page_ext *page_ext = lookup_page_ext(page);
> > > +
> > > +	if (unlikely(!page_ext))
> > > +		return;
> > > +
> > > +	trace_page_pinner_failure(page);
> > > +	test_and_set_bit(PAGE_EXT_PINNER, &page_ext->flags);
> > > +}
> > > +
> > > +void __page_pinner_put(struct page *page)
> > > +{
> > > +	struct page_ext *page_ext = lookup_page_ext(page);
> > > +
> > > +	if (unlikely(!page_ext))
> > > +		return;
> > > +
> > > +	if (!test_bit(PAGE_EXT_PINNER, &page_ext->flags))
> > > +		return;
> > > +
> > > +	trace_page_pinner_put(page);
> > > +}
> > > +EXPORT_SYMBOL(__page_pinner_put);
> > > +
> > > +
> > > +static int __init page_pinner_init(void)
> > > +{
> > > +	if (!static_branch_unlikely(&page_pinner_inited)) {
> > > +		pr_info("page_pinner is disabled\n");
> > > +		return 0;
> > > +	}
> > > +
> > > +	pr_info("page_pinner is enabled\n");
> > > +	return 0;
> > > +}
> > > +late_initcall(page_pinner_init)
> > > -- 
> > > 2.34.1.400.ga245620fadb-goog
> > > 
> > 
> > More info about my compaction issue:
> > 
> > This call stack returns -EAGAIN in 99.9% cases on the problematic host
> > (Ubuntu 20.04 with kernel 5.11.0-40):
> > 
> > migrate_page_move_mapping (now folio_migrate_mapping) <- returns -EAGAIN
> > migrate_page
> > fallback_migrate_page
> > move_to_new_page
> > migrate_pages
> > compact_zone
> > compact_zone_order
> > try_to_compact_pages
> > __alloc_pages_direct_compact
> > __alloc_pages_slowpath.constprop.0
> > __alloc_pages_nodemask
> > alloc_pages_vma
> > do_huge_pmd_anonymous_page
> > __handle_mm_fault
> > handle_mm_fault
> > do_user_addr_fault
> > exc_page_fault
> > asm_exc_page_fault
> > 
> > The offending pages are from shm, allocated by mmap() with MAP_SHARED by a
> > machine learning program. They may have relationships with NVIDIA CUDA, but I
> > want to confirm this, and make improvements if possible.
> 
> So you are suspecting some kernel driver hold a addtional refcount
> using get_user_pages or page get API?

Yes. By using the trace events in this patch, I have confirmed it is nvidia
kernel module that holds the refcount. I got the stacktrace like this (from
"perf script"):

cuda-EvtHandlr 31023 [000]  3244.976411:                   page_pinner:page_pinner_put: pfn=0x13e473 flags=0x8001e count=0 mapcount=0 mapping=(nil) mt=1
        ffffffff82511be4 __page_pinner_put+0x54 (/lib/modules/5.15.6+/build/vmlinux)
        ffffffff82511be4 __page_pinner_put+0x54 (/lib/modules/5.15.6+/build/vmlinux)
        ffffffffc0b71e1f os_unlock_user_pages+0xbf ([nvidia])
        ffffffffc14a4546 _nv032165rm+0x96 ([nvidia])

Still not much information. NVIDIA does not want me to debug its module. Maybe
the only thing I can do is reporting this to NVIDIA.

> > When the issue reproduce, a single page fault that triggers a sync compaction
> > can take tens of seconds. Then all 40 CPU threads are doing compaction, and
> > application runs several order of magnitude slower.
> > 
> > Disabling sync compaction is a workaround (the default is "madvise"):
> > 
> > echo never > /sys/kernel/mm/transparent_hugepage/defrag
> > 
> > Previously I asked for help at https://lore.kernel.org/linux-mm/20210516085644.13800-1-hdanton@xxxxxxxx/
> > Now I have more information but still cannot pinpoint the root cause.
> > 
> > Thanks,
> > Hu Weiwen
> > 
> >