Re: [PATCH] mm/page_isolation: fix a deadlock with printk()

David Hildenbrand <david@xxxxxxxxxx> · Fri, 4 Oct 2019 19:47:55 +0200

On 04.10.19 18:42, Qian Cai wrote:
> It is unsafe to call printk() while zone->lock was held, i.e.,
> 
> zone->lock --> console_sem
> 
> because the console could always allocate some memory in different code
> paths and form locking chains in an opposite order,
> 
> console_sem --> * --> zone->lock
> 
> As the result, it triggers lockdep splats like below and in [1]. It is
> fine to take zone->lock after has_unmovable_pages() (which has
> dump_stack()) in set_migratetype_isolate(). While at it, remove a
> problematic printk() in __offline_isolated_pages() only for debugging as
> well which will always disable lockdep on debug kernels.
> 
> The problem is probably there forever, but neither many developers will
> run memory offline with the lockdep enabled nor admins in the field are
> lucky enough yet to hit a perfect timing which required to trigger a
> real deadlock. In addition, there aren't many places that call printk()
> while zone->lock was held.
> 
> WARNING: possible circular locking dependency detected
> ------------------------------------------------------
> test.sh/1724 is trying to acquire lock:
> 0000000052059ec0 (console_owner){-...}, at: console_unlock+0x
> 01: 328/0xa30
> 
> but task is already holding lock:
> 000000006ffd89c8 (&(&zone->lock)->rlock){-.-.}, at: start_iso
> 01: late_page_range+0x216/0x538
> 
> which lock already depends on the new lock.
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #2 (&(&zone->lock)->rlock){-.-.}:
>        lock_acquire+0x21a/0x468
>        _raw_spin_lock+0x54/0x68
>        get_page_from_freelist+0x8b6/0x2d28
>        __alloc_pages_nodemask+0x246/0x658
>        __get_free_pages+0x34/0x78
>        sclp_init+0x106/0x690
>        sclp_register+0x2e/0x248
>        sclp_rw_init+0x4a/0x70
>        sclp_console_init+0x4a/0x1b8
>        console_init+0x2c8/0x410
>        start_kernel+0x530/0x6a0
>        startup_continue+0x70/0xd0
> 
> -> #1 (sclp_lock){-.-.}:
>        lock_acquire+0x21a/0x468
>        _raw_spin_lock_irqsave+0xcc/0xe8
>        sclp_add_request+0x34/0x308
>        sclp_conbuf_emit+0x100/0x138
>        sclp_console_write+0x96/0x3b8
>        console_unlock+0x6dc/0xa30
>        vprintk_emit+0x184/0x3c8
>        vprintk_default+0x44/0x50
>        printk+0xa8/0xc0
>        iommu_debugfs_setup+0xf2/0x108
>        iommu_init+0x6c/0x78
>        do_one_initcall+0x162/0x680
>        kernel_init_freeable+0x4e8/0x5a8
>        kernel_init+0x2a/0x188
>        ret_from_fork+0x30/0x34
>        kernel_thread_starter+0x0/0xc
> 
> -> #0 (console_owner){-...}:
>        check_noncircular+0x338/0x3e0
>        __lock_acquire+0x1e66/0x2d88
>        lock_acquire+0x21a/0x468
>        console_unlock+0x3a6/0xa30
>        vprintk_emit+0x184/0x3c8
>        vprintk_default+0x44/0x50
>        printk+0xa8/0xc0
>        __dump_page+0x1dc/0x710
>        dump_page+0x2e/0x58
>        has_unmovable_pages+0x2e8/0x470
>        start_isolate_page_range+0x404/0x538
>        __offline_pages+0x22c/0x1338
>        memory_subsys_offline+0xa6/0xe8
>        device_offline+0xe6/0x118
>        state_store+0xf0/0x110
>        kernfs_fop_write+0x1bc/0x270
>        vfs_write+0xce/0x220
>        ksys_write+0xea/0x190
>        system_call+0xd8/0x2b4
> 
> other info that might help us debug this:
> 
> Chain exists of:
>   console_owner --> sclp_lock --> &(&zone->lock)->rlock
> 
>  Possible unsafe locking scenario:
> 
>        CPU0                    CPU1
>        ----                    ----
>   lock(&(&zone->lock)->rlock);
>                                lock(sclp_lock);
>                                lock(&(&zone->lock)->rlock);
>   lock(console_owner);
> 
>  *** DEADLOCK ***
> 
> 9 locks held by test.sh/1724:
>  #0: 000000000e925408 (sb_writers#4){.+.+}, at: vfs_write+0x201:
>  #1: 0000000050aa4280 (&of->mutex){+.+.}, at: kernfs_fop_write:
>  #2: 0000000062e5c628 (kn->count#198){.+.+}, at: kernfs_fop_write
>  #3: 00000000523236a0 (device_hotplug_lock){+.+.}, at:
> lock_device_hotplug_sysfs+0x30/0x80
>  #4: 0000000062e70990 (&dev->mutex){....}, at: device_offline
>  #5: 0000000051fd36b0 (cpu_hotplug_lock.rw_sem){++++}, at:
> __offline_pages+0xec/0x1338
>  #6: 00000000521ca470 (mem_hotplug_lock.rw_sem){++++}, at:
> percpu_down_write+0x38/0x210
>  #7: 000000006ffd89c8 (&(&zone->lock)->rlock){-.-.}, at:
> start_isolate_page_range+0x216/0x538
>  #8: 000000005205a100 (console_lock){+.+.}, at: vprintk_emit
> 
> stack backtrace:
> Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)
> Call Trace:
> ([<00000000512ae218>] show_stack+0x110/0x1b0)
>  [<0000000051b6d506>] dump_stack+0x126/0x178
>  [<00000000513a4b08>] check_noncircular+0x338/0x3e0
>  [<00000000513aaaf6>] __lock_acquire+0x1e66/0x2d88
>  [<00000000513a7e12>] lock_acquire+0x21a/0x468
>  [<00000000513bb2fe>] console_unlock+0x3a6/0xa30
>  [<00000000513bde2c>] vprintk_emit+0x184/0x3c8
>  [<00000000513be0b4>] vprintk_default+0x44/0x50
>  [<00000000513beb60>] printk+0xa8/0xc0
>  [<000000005158c364>] __dump_page+0x1dc/0x710
>  [<000000005158c8c6>] dump_page+0x2e/0x58
>  [<00000000515d87c8>] has_unmovable_pages+0x2e8/0x470
>  [<000000005167072c>] start_isolate_page_range+0x404/0x538
>  [<0000000051b96de4>] __offline_pages+0x22c/0x1338
>  [<0000000051908586>] memory_subsys_offline+0xa6/0xe8
>  [<00000000518e561e>] device_offline+0xe6/0x118
>  [<0000000051908170>] state_store+0xf0/0x110
>  [<0000000051796384>] kernfs_fop_write+0x1bc/0x270
>  [<000000005168972e>] vfs_write+0xce/0x220
>  [<0000000051689b9a>] ksys_write+0xea/0x190
>  [<0000000051ba9990>] system_call+0xd8/0x2b4
> INFO: lockdep is turned off.
> 
> [1] https://lore.kernel.org/lkml/1568817579.5576.172.camel@xxxxxx/
> 
> Signed-off-by: Qian Cai <cai@xxxxxx>
> ---
>  mm/page_alloc.c     |  4 ----
>  mm/page_isolation.c | 10 +++++-----
>  2 files changed, 5 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 15c2050c629b..232bbb1dc521 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8588,10 +8588,6 @@ void zone_pcp_reset(struct zone *zone)
>  		BUG_ON(!PageBuddy(page));
>  		order = page_order(page);
>  		offlined_pages += 1 << order;
> -#ifdef CONFIG_DEBUG_VM
> -		pr_info("remove from free list %lx %d %lx\n",
> -			pfn, 1 << order, end_pfn);
> -#endif
>  		del_page_from_free_area(page, &zone->free_area[order]);
>  		for (i = 0; i < (1 << order); i++)
>  			SetPageReserved((page+i));
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index 89c19c0feadb..8682ccb5fbd1 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -25,8 +25,6 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
>  
>  	zone = page_zone(page);
>  
> -	spin_lock_irqsave(&zone->lock, flags);
> -
>  	/*
>  	 * We assume the caller intended to SET migrate type to isolate.
>  	 * If it is already set, then someone else must have raced and
> @@ -74,16 +72,18 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
>  		int mt = get_pageblock_migratetype(page);
>  
>  		set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> +
> +		spin_lock_irqsave(&zone->lock, flags);

The migratetype has to be tested and set under lock, otherwise two
clients could race. I don't like such severe locking changes just to
make some printing we only need for debugging work.

Can't we somehow return some information (page / cause) from
has_unmovable_pages() and print from a save place instead?

To fix the BUG, I would much rather want to see all printing getting
ripped out instead. That's easy to backort.

We can then come back and think about how to log such stuff in order to
debug it.

-- 

Thanks,

David / dhildenb