Re: [PATCH] compact_pgdat: workaround lockdep warning in kswapd

Mel Gorman <mel@xxxxxxxxx> · Thu, 9 Feb 2012 15:40:14 +0000

On Mon, Feb 06, 2012 at 11:40:08AM -0800, Hugh Dickins wrote:
> I get this lockdep warning from swapping load on linux-next
> (20120201 but I expect the same from more recent days):
> 

Only getting to this now. Yes, I'm slow.

> =================================
> [ INFO: inconsistent lock state ]
> 3.3.0-rc2-next-20120201 #5 Not tainted
> ---------------------------------
> inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.

Ok, shoving that through my spidey decoder ring, that is saying we were
in reclaim context (RECLAIM_FS-ON-W) and taking a mutex that at in
the past was taken from a different reclaim context. In lockdeps mind,
this leads to a potential deadlock where a user of pcpu_alloc needs
kswapd or a reclaimer to make forward progress that can't because it
depends on the same mutex.

> kswapd0/28 [HC0[0]:SC0[0]:HE1:SE1] takes:
>  (pcpu_alloc_mutex){+.+.?.}, at: [<ffffffff810d6684>] pcpu_alloc+0x67/0x325
> {RECLAIM_FS-ON-W} state was registered at:
>   [<ffffffff81099b75>] mark_held_locks+0xd7/0x103
>   [<ffffffff8109a13c>] lockdep_trace_alloc+0x85/0x9e
>   [<ffffffff810f6bdc>] __kmalloc+0x6c/0x14b
>   [<ffffffff810d57fd>] pcpu_mem_zalloc+0x59/0x62
>   [<ffffffff810d5d16>] pcpu_extend_area_map+0x26/0xb1
>   [<ffffffff810d679f>] pcpu_alloc+0x182/0x325
>   [<ffffffff810d694d>] __alloc_percpu+0xb/0xd
>   [<ffffffff8142ebfd>] snmp_mib_init+0x1e/0x2e
>   [<ffffffff8185cd8d>] ipv4_mib_init_net+0x7a/0x184
>   [<ffffffff813dc963>] ops_init.clone.0+0x6b/0x73
>   [<ffffffff813dc9cc>] register_pernet_operations+0x61/0xa0
>   [<ffffffff813dca8e>] register_pernet_subsys+0x29/0x42
>   [<ffffffff8185d044>] inet_init+0x1ad/0x252
>   [<ffffffff810002e3>] do_one_initcall+0x7a/0x12f
>   [<ffffffff81832bc5>] kernel_init+0x9d/0x11e
>   [<ffffffff814e51e4>] kernel_thread_helper+0x4/0x10
> irq event stamp: 656613
> hardirqs last  enabled at (656613): [<ffffffff814e0ddc>] __mutex_unlock_slowpath+0x104/0x128
> hardirqs last disabled at (656612): [<ffffffff814e0d34>] __mutex_unlock_slowpath+0x5c/0x128
> softirqs last  enabled at (655568): [<ffffffff8105b4a5>] __do_softirq+0x120/0x136
> softirqs last disabled at (654757): [<ffffffff814e52dc>] call_softirq+0x1c/0x30
> 
> other info that might help us debug this:
>  Possible unsafe locking scenario:
> 
>        CPU0
>        ----
>   lock(pcpu_alloc_mutex);
>   <Interrupt>
>     lock(pcpu_alloc_mutex);
> 
>  *** DEADLOCK ***
> 
> no locks held by kswapd0/28.
> 
> stack backtrace:
> Pid: 28, comm: kswapd0 Not tainted 3.3.0-rc2-next-20120201 #5
> Call Trace:
>  [<ffffffff810981f4>] print_usage_bug+0x1bf/0x1d0
>  [<ffffffff81096c3e>] ? print_irq_inversion_bug+0x1d9/0x1d9
>  [<ffffffff810982c0>] mark_lock_irq+0xbb/0x22e
>  [<ffffffff810c5399>] ? free_hot_cold_page+0x13d/0x14f
>  [<ffffffff81098684>] mark_lock+0x251/0x331
>  [<ffffffff81098893>] mark_irqflags+0x12f/0x141
>  [<ffffffff81098e32>] __lock_acquire+0x58d/0x753
>  [<ffffffff810d6684>] ? pcpu_alloc+0x67/0x325
>  [<ffffffff81099433>] lock_acquire+0x54/0x6a
>  [<ffffffff810d6684>] ? pcpu_alloc+0x67/0x325
>  [<ffffffff8107a5b8>] ? add_preempt_count+0xa9/0xae
>  [<ffffffff814e0a21>] mutex_lock_nested+0x5e/0x315
>  [<ffffffff810d6684>] ? pcpu_alloc+0x67/0x325
>  [<ffffffff81098f81>] ? __lock_acquire+0x6dc/0x753
>  [<ffffffff810c9fb0>] ? __pagevec_release+0x2c/0x2c
>  [<ffffffff810d6684>] pcpu_alloc+0x67/0x325
>  [<ffffffff810c9fb0>] ? __pagevec_release+0x2c/0x2c
>  [<ffffffff810d694d>] __alloc_percpu+0xb/0xd
>  [<ffffffff8106c35e>] schedule_on_each_cpu+0x23/0x110
>  [<ffffffff810c9fcb>] lru_add_drain_all+0x10/0x12
>  [<ffffffff810f126f>] __compact_pgdat+0x20/0x182
>  [<ffffffff810f15c2>] compact_pgdat+0x27/0x29
>  [<ffffffff810c306b>] ? zone_watermark_ok+0x1a/0x1c
>  [<ffffffff810cdf6f>] balance_pgdat+0x732/0x751
>  [<ffffffff810ce0ed>] kswapd+0x15f/0x178
>  [<ffffffff810cdf8e>] ? balance_pgdat+0x751/0x751
>  [<ffffffff8106fd11>] kthread+0x84/0x8c
>  [<ffffffff814e51e4>] kernel_thread_helper+0x4/0x10
>  [<ffffffff810787ed>] ? finish_task_switch+0x85/0xea
>  [<ffffffff814e3861>] ? retint_restore_args+0xe/0xe
>  [<ffffffff8106fc8d>] ? __init_kthread_worker+0x56/0x56
>  [<ffffffff814e51e0>] ? gs_change+0xb/0xb
> 
> The RECLAIM_FS notations indicate that it's doing the GFP_FS checking
> that Nick hacked into lockdep a while back: I think we're intended to
> read that "<Interrupt>" in the DEADLOCK scenario as "<Direct reclaim>".
> 

It's not GFP_FS it is complaining about though. It's complaining because
that mutex is being taken from inconsistent reclaim contexts. At least,
that is my reading of it. It's not often I read lockdep reports so I
could be wrong.

> I'm hazy, I have not reached any conclusion as to whether it's right
> to complain or not; but I believe it's uneasy about kswapd now doing
> the mutex_lock(&pcpu_alloc_mutex) which lru_add_drain_all() entails.
> Nor have I reached any conclusion as to whether it's important for
> kswapd to do that draining or not.
> 

It's not important for kswapd to do this draining. Compaction via proc
does the draining to maximise the amount of compaction it is able to do.
kswapd is best effort and not even doing sync compaction, let alone
caring about draining pagevecs.

> But so as not to get blocked on this, with lockdep disabled from giving
> further reports, here's a patch which removes the lru_add_drain_all()
> from kswapd's callpath (and calls it only once from compact_nodes(),
> instead of once per node).
> 
> Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx>

Thanks

Acked-by: Mel Gorman <mel@xxxxxxxxx>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>