Re: Xfs lockdep warning with for-dave-for-4.6 branch

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 12 May 2016 18:03:21 +1000

[ cc Michal Hocko, just so he can see a lockdep reclaim state false
positive ]

On Wed, May 11, 2016 at 10:57:56PM -0700, Darrick J. Wong wrote:
> On Thu, May 12, 2016 at 01:53:44PM +0800, Qu Wenruo wrote:
> > Hi Darrick,
> > 
> > Thanks for your reflink work for xfs first, that's quite a good and useful
> > feature, also helps to debug btrfs problems.
> > (Without that, there is no good reference for reflink behavior)
> > 
> > But when testing your for-dave-for-4.6 branch, even I'm just testing btrfs
> > with xfstests, kernel report some strange lockdep from xfs:
.....
> > =================================
> > [ INFO: inconsistent lock state ]
> > 4.5.0-rc2+ #4 Tainted: G           O
> > ---------------------------------
> > inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
> > kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:
> >  (&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>]
> > xfs_ilock+0x177/0x200 [xfs]
> > {RECLAIM_FS-ON-R} state was registered at:
> >   [<ffffffff8110f369>] mark_held_locks+0x79/0xa0
> >   [<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100
> >   [<ffffffff81224623>] kmem_cache_alloc+0x33/0x230
> >   [<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs]
> >   [<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
> >   [<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs]
> >   [<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs]
> >   [<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs]
> >   [<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs]
> >   [<ffffffff81244208>] do_vfs_ioctl+0x498/0x670
> >   [<ffffffff81244459>] SyS_ioctl+0x79/0x90
> >   [<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f

Ignoring whether reflink should be doing anything or not, that's a
"xfs_refcountbt_init_cursor() gets called both outside and inside
transactions" lockdep false positive case. The problem here is
lockdep has seen this allocation from within a transaction, hence a
GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context.
Also note that we have an active reference to this inode.

> > irq event stamp: 510775
> > hardirqs last  enabled at (510775): [<ffffffff812245d0>]
> > __slab_alloc+0x50/0x70
> > hardirqs last disabled at (510774): [<ffffffff812245ae>]
> > __slab_alloc+0x2e/0x70
> > softirqs last  enabled at (510506): [<ffffffff810c8ea8>]
> > __do_softirq+0x358/0x430
> > softirqs last disabled at (510489): [<ffffffff810c911d>] irq_exit+0xad/0xb0
> > 
> > other info that might help us debug this:
> >  Possible unsafe locking scenario:
> > 
> >        CPU0
> >        ----
> >   lock(&xfs_nondir_ilock_class);
> >   <Interrupt>
> >     lock(&xfs_nondir_ilock_class);

So, because the reclaim annotations overload the interrupt level
detections and it's seen the inode ilock been taken in reclaim
("interrupt") context, this triggers a reclaim context warning where
it thinks it is unsafe to do this allocation in GFP_KERNEL context
holding the inode ilock...

> >  *** DEADLOCK ***
> > 
> > 3 locks held by kswapd0/543:
> >  #0:  (shrinker_rwsem){++++..}, at: [<ffffffff811e0b78>]
> > shrink_slab.part.63.constprop.79+0x48/0x4e0
> >  #1:  (&type->s_umount_key#26){++++++}, at: [<ffffffff81232ffb>]
> > trylock_super+0x1b/0x50
> >  #2:  (sb_internal){.+.+.?}, at: [<ffffffff812327f4>]
> > __sb_start_write+0xb4/0xf0
> > 
> > stack backtrace:
> > CPU: 0 PID: 543 Comm: kswapd0 Tainted: G           O    4.5.0-rc2+ #4
> > Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox
> > 12/01/2006
> >  ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0
> >  ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001
> >  ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0
> > Call Trace:
> >  [<ffffffff813a14f9>] dump_stack+0x4b/0x72
> >  [<ffffffff8110ec65>] print_usage_bug+0x215/0x240
> >  [<ffffffff8110ee85>] mark_lock+0x1f5/0x660
> >  [<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0
> >  [<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50
> >  [<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230
> >  [<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs]
> >  [<ffffffff811122e8>] lock_acquire+0xd8/0x1e0
> >  [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
> >  [<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
> >  [<ffffffff8110aace>] down_write_nested+0x5e/0xc0
> >  [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
> >  [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]
> >  [<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
> >  [<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
> >  [<ffffffff8124d7d5>] evict+0xc5/0x190
> >  [<ffffffff8124d8d9>] dispose_list+0x39/0x60
> >  [<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60
> >  [<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0
> >  [<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0
> >  [<ffffffff811e50ee>] shrink_zone+0x15e/0x170
> >  [<ffffffff811e5ef1>] kswapd+0x4f1/0xa80
> >  [<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230
> >  [<ffffffff810e6882>] kthread+0xf2/0x110
> >  [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
> >  [<ffffffff8184803f>] ret_from_fork+0x3f/0x70
> >  [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
> > hrtimer: interrupt took 4824925 ns

The reclaim -> lock context that it's complaining about here is on
an inode being reclaimed - it has no active references and so, by
definition, cannot deadlock with a context holding an active
reference to an inode ilock. Hence there cannot possibly be a
deadlock here, but we can't tell lockdep that easily in any way
without going back to the bad old ways of creating a new lockdep
class for inode ilocks the moment they enter ->evict. This then
disables "entire lifecycle" lockdep checking on the xfs inode ilock,
which is why we got rid of it in the first place.

Our "traditional" solution to this is to add KM_NOFS to the
allocation to force GFP_NOFS allocation in all contexts so that we
shut up lockdep and people don't keep endlessly reporting the same
non-problem.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs