On Thu 06-10-16 15:04:54, Michal Hocko wrote: > [Let me ressurect this thread] ping > On Wed 01-06-16 20:16:17, Peter Zijlstra wrote: > > On Wed, Jun 01, 2016 at 03:17:58PM +0200, Michal Hocko wrote: > > > Thanks Dave for your detailed explanation again! Peter do you have any > > > other idea how to deal with these situations other than opt out from > > > lockdep reclaim machinery? > > > > > > If not I would rather go with an annotation than a gfp flag to be honest > > > but if you absolutely hate that approach then I will try to check wheter > > > a CONFIG_LOCKDEP GFP_FOO doesn't break something else. Otherwise I would > > > steal the description from Dave's email and repost my patch. > > > > > > I plan to repost my scope gfp patches in few days and it would be good > > > to have some mechanism to drop those GFP_NOFS to paper over lockdep > > > false positives for that. > > > > Right; sorry I got side-tracked in other things again. > > > > So my favourite is the dedicated GFP flag, but if that's unpalatable for > > the mm folks then something like the below might work. It should be > > similar in effect to your proposal, except its more limited in scope. > > OK, so the situation with the GFP flags is somehow relieved after > http://lkml.kernel.org/r/20160912114852.GI14524@xxxxxxxxxxxxxx and with > the root radix tree remaining the last user which mangles gfp_mask and > tags together we have some few bits left there. As you apparently hate > any scoped API and Dave thinks that per allocation flag is the only > maintainable way for xfs what do you think about the following? > --- > From 04b3923e5b12f0eb3859f0718881fa0f40e60164 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@xxxxxxxx> > Date: Fri, 13 May 2016 17:47:31 +0200 > Subject: [PATCH] lockdep: allow to disable reclaim lockup detection > > The current implementation of the reclaim lockup detection can lead to > false positives and those even happen and usually lead to tweak the > code to silence the lockdep by using GFP_NOFS even though the context > can use __GFP_FS just fine. See > http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example. > > ================================= > [ INFO: inconsistent lock state ] > 4.5.0-rc2+ #4 Tainted: G O > --------------------------------- > inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage. > kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes: > > (&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs] > > {RECLAIM_FS-ON-R} state was registered at: > [<ffffffff8110f369>] mark_held_locks+0x79/0xa0 > [<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100 > [<ffffffff81224623>] kmem_cache_alloc+0x33/0x230 > [<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs] > [<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs] > [<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs] > [<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs] > [<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs] > [<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs] > [<ffffffff81244208>] do_vfs_ioctl+0x498/0x670 > [<ffffffff81244459>] SyS_ioctl+0x79/0x90 > [<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f > > CPU0 > ---- > lock(&xfs_nondir_ilock_class); > <Interrupt> > lock(&xfs_nondir_ilock_class); > > *** DEADLOCK *** > > 3 locks held by kswapd0/543: > > stack backtrace: > CPU: 0 PID: 543 Comm: kswapd0 Tainted: G O 4.5.0-rc2+ #4 > > Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 > > ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0 > ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001 > ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0 > Call Trace: > [<ffffffff813a14f9>] dump_stack+0x4b/0x72 > [<ffffffff8110ec65>] print_usage_bug+0x215/0x240 > [<ffffffff8110ee85>] mark_lock+0x1f5/0x660 > [<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0 > [<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50 > [<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230 > [<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs] > [<ffffffff811122e8>] lock_acquire+0xd8/0x1e0 > [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs] > [<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs] > [<ffffffff8110aace>] down_write_nested+0x5e/0xc0 > [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs] > [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs] > [<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs] > [<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs] > [<ffffffff8124d7d5>] evict+0xc5/0x190 > [<ffffffff8124d8d9>] dispose_list+0x39/0x60 > [<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60 > [<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0 > [<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0 > [<ffffffff811e50ee>] shrink_zone+0x15e/0x170 > [<ffffffff811e5ef1>] kswapd+0x4f1/0xa80 > [<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230 > [<ffffffff810e6882>] kthread+0xf2/0x110 > [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220 > [<ffffffff8184803f>] ret_from_fork+0x3f/0x70 > [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220 > > To quote Dave: > " > Ignoring whether reflink should be doing anything or not, that's a > "xfs_refcountbt_init_cursor() gets called both outside and inside > transactions" lockdep false positive case. The problem here is > lockdep has seen this allocation from within a transaction, hence a > GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context. > Also note that we have an active reference to this inode. > > So, because the reclaim annotations overload the interrupt level > detections and it's seen the inode ilock been taken in reclaim > ("interrupt") context, this triggers a reclaim context warning where > it thinks it is unsafe to do this allocation in GFP_KERNEL context > holding the inode ilock... > " > > This sounds like a fundamental problem of the reclaim lock detection. > It is really impossible to annotate such a special usecase IMHO unless > the reclaim lockup detection is reworked completely. Until then it > is much better to provide a way to add "I know what I am doing flag" > and mark problematic places. This would prevent from abusing GFP_NOFS > flag which has a runtime effect even on configurations which have > lockdep disabled. > > Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to > skip the current allocation request. > > While we are at it also make sure that the radix tree doesn't > accidentaly override tags stored in the upper part of the gfp_mask. > > Suggested-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > Signed-off-by: Michal Hocko <mhocko@xxxxxxxx> > --- > include/linux/gfp.h | 10 +++++++++- > kernel/locking/lockdep.c | 4 ++++ > lib/radix-tree.c | 1 + > 3 files changed, 14 insertions(+), 1 deletion(-) > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index 3e6c48dbe6b9..cee3d5fa3821 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -41,6 +41,11 @@ struct vm_area_struct; > #define ___GFP_OTHER_NODE 0x800000u > #define ___GFP_WRITE 0x1000000u > #define ___GFP_KSWAPD_RECLAIM 0x2000000u > +#ifdef CONFIG_LOCKDEP > +#define ___GFP_NOLOCKDEP 0x4000000u > +#else > +#define ___GFP_NOLOCKDEP 0 > +#endif > /* If the above are modified, __GFP_BITS_SHIFT may need updating */ > > /* > @@ -186,8 +191,11 @@ struct vm_area_struct; > #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK) > #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) > > +/* Disable lockdep for GFP context tracking */ > +#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP) > + > /* Room for N __GFP_FOO bits */ > -#define __GFP_BITS_SHIFT 26 > +#define __GFP_BITS_SHIFT (26 + IS_ENABLED(CONFIG_LOCKDEP)) > #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) > > /* > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > index d96c6e058467..a652ac8b3cfa 100644 > --- a/kernel/locking/lockdep.c > +++ b/kernel/locking/lockdep.c > @@ -2871,6 +2871,10 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags) > if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))) > return; > > + /* Disable lockdep if explicitly requested */ > + if (gfp_mask & __GFP_NOLOCKDEP) > + return; > + > mark_held_locks(curr, RECLAIM_FS); > } > > diff --git a/lib/radix-tree.c b/lib/radix-tree.c > index 1b7bf7314141..3154403d30e8 100644 > --- a/lib/radix-tree.c > +++ b/lib/radix-tree.c > @@ -1672,6 +1672,7 @@ static int radix_tree_callback(struct notifier_block *nfb, > > void __init radix_tree_init(void) > { > + BUILD_BUG_ON(RADIX_TREE_MAX_TAGS + __GFP_BITS_SHIFT > 32); > radix_tree_node_cachep = kmem_cache_create("radix_tree_node", > sizeof(struct radix_tree_node), 0, > SLAB_PANIC | SLAB_RECLAIM_ACCOUNT, > -- > 2.9.3 > > -- > Michal Hocko > SUSE Labs -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>