Thanks Dave for your detailed explanation again! Peter do you have any other idea how to deal with these situations other than opt out from lockdep reclaim machinery? If not I would rather go with an annotation than a gfp flag to be honest but if you absolutely hate that approach then I will try to check wheter a CONFIG_LOCKDEP GFP_FOO doesn't break something else. Otherwise I would steal the description from Dave's email and repost my patch. I plan to repost my scope gfp patches in few days and it would be good to have some mechanism to drop those GFP_NOFS to paper over lockdep false positives for that. [keeping Dave's explanation for reference] On Fri 20-05-16 10:17:14, Dave Chinner wrote: > On Thu, May 19, 2016 at 10:11:46AM +0200, Peter Zijlstra wrote: > > On Wed, May 18, 2016 at 08:35:49AM +1000, Dave Chinner wrote: [...] > > > There's a maze of dark, grue-filled twisty passages here... > > > > OK; I might need a bit more again. > > > > So now the code does something like: > > > > down_read(&i_lock); -- lockdep marks lock as held > > kmalloc(GFP_KERNEL); -- lockdep marks held locks as ENABLED_RECLAIM_FS > > --> reclaim() > > down_read_trylock(&i_lock); -- lockdep does _NOT_ mark as USED_IN_RECLAIM_FS > > > > Right? > > In the path that can deadlock the log, yes. It's actually way more > complex than the above, because the down_read_trylock(&i_lock) that > matters is run in a completely separate, async kthread that > xfs_trans_reserve() will block waiting for. > > process context xfsaild kthread(*) > --------------- ------------------ > down_read(&i_lock); -- lockdep marks lock as held > kmalloc(GFP_KERNEL); -- lockdep marks held locks as ENABLED_RECLAIM_FS > --> reclaim() > xfs_trans_reserve() > .... > xfs_trans_push_ail() ---- called if no space in the log to kick the xfsaild into action > .... > xlog_grant_head_wait() ---- blocks waiting for log space > ..... > > xfsaild_push() ----- iterates AIL > grabs log item > lock log item > >>>>>>>>>>>>>>>>>>>>> down_read_trylock(&i_lock); > format item into buffer > add to dirty buffer list > .... > submit dirty buffer list for IO > buffer IO started > ..... > <async IO completion context> > buffer callbacks > mark inode clean > remove inode from AIL > move tail of log forwards > wake grant head waiters > <woken by log tail moving> > <log space available> > transaction reservation granted > ..... > down_write(some other inode ilock) > <modify some other inode> > xfs_trans_commit > ..... > > (*) xfsaild runs with PF_MEMALLOC context. > > The problem is that if the ilock is held exclusively at GFP_KERNEL > time, the xfsaild cannot lock the inode to flush it, so if that > inode pins the tail of the log then we can't make space available > for xfs_trans_reserve and there is the deadlock. > > Once xfs_trans_reserve completes, however, we'll take the ilock on > *some other inode*, and that's where the "it can't be the inode we > currently hold locked because we have references to it" and > henceit's safe to have a pattern like: > > down_read(&i_lock); -- lockdep marks lock as held > kmalloc(GFP_KERNEL); -- lockdep marks held locks as ENABLED_RECLAIM_FS > --> reclaim() > down_write(&ilock) > > because the lock within reclaim context is completely unrelated to > the lock we already hold. > > Lockdep can't possibly know about this because the deadlock involves > locking contexts that *aren't doing anything wrong within their own > contexts*. It's only when you add the dependency of log space > reservation requirements needed to make forwards progress that > there's then an issue with locking and reclaim. > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>