On Tue, May 17, 2016 at 04:49:12PM +0200, Peter Zijlstra wrote: > > Thanks for writing all that down Dave! > > On Tue, May 17, 2016 at 09:10:56AM +1000, Dave Chinner wrote: > > > The reason we don't have lock clases for the ilock is that we aren't > > supposed to call memory reclaim with that lock held in exclusive > > mode. This is because reclaim can run transactions, and that may > > need to flush dirty inodes to make progress. Flushing dirty inode > > requires taking the ilock in shared mode. > > > > In the code path that was reported, we hold the ilock in /shared/ > > mode with no transaction context (we are doing a read-only > > operation). This means we can run transactions in memory reclaim > > because a) we can't deadlock on the inode we hold locks on, and b) > > transaction reservations will be able to make progress as we don't > > hold any locks it can block on. > > Just to clarify; I read the above as that we cannot block on recursive > shared locks, is this correct? > > Because we can in fact block on down_read()+down_read() just fine, so if > you're assuming that, then something's busted. The transaction reservation path will run down_read_trylock() on the inode, not down_read(). Hence if there are no pending writers, it will happily take the lock twice and make progress, otherwise it will skip the inode and there's no deadlock. If there's a pending writer, then we have another context that is already in a transaction context and has already pushed the item, hence it is only in the scope of the current push because IO hasn't completed yet and removed it from the list. > Otherwise, I'm not quite reading it right, which is, given the > complexity of that stuff, entirely possible. There's a maze of dark, grue-filled twisty passages here... > The other possible reading is that we cannot deadlock on the inode we > hold locks on because we hold a reference on it; and the reference > avoids the inode from being reclaimed. But then the whole > shared/exclusive thing doesn't seem to make sense. Right, because that's not the problem. The issue has to do with transaction contexts and what locks are safe to hold when calling xfs_trans_reserve(). Direct reclaim is putting xfs_trans_reserve() behind memory allocation, which means it is unsafe for XFS to hold the ilock exclusive or be in an existing transaction context when doing GFP_KERNEL allocation. > > For the ilock, the number of places where the ilock is held over > > GFP_KERNEL allocations is pretty small. Hence we've simply added > > GFP_NOFS to those allocations to - effectively - annotate those > > allocations as "lockdep causes problems here". There are probably > > 30-35 allocations in XFS that explicitly use KM_NOFS - some of these > > are masking lockdep false positive reports. > > > > In the end, like pretty much all the complex lockdep false positives > > we've had to deal in XFS, we've ended up changing the locking or > > allocation contexts because that's been far easier than trying to > > make annotations cover everything or convince other people that > > lockdep annotations are insufficient. > > Well, I don't mind creating lockdep annotations; but explanations of the > exact details always go a long way towards helping me come up with > something. > > While going over the code; I see there's complaining about > MAX_SUBCLASSES being too small. Would it help if we doubled it? We > cannot grow the thing without limits, but doubling it should be possible > I think. Last time I asked cwif we could increase MAX_SUBCLASSES I was told no. So we've just had to try to fit about 30 different inode lock contexts into 8 subclasses split across multiple class types (i.e. xfs_[non]dir_ilock_class). I wasted an entire week on getting those annotations to fit the limitations of lockdep and still work. > In any case; would something like this work for you? Its entirely > untested, but the idea is to mark an entire class to skip reclaim > validation, instead of marking individual sites. Probably would, but it seems like swatting a fly with runaway train. I'd much prefer a per-site annotation (e.g. as a GFP_ flag) so that we don't turn off something that will tell us we've made a mistake while developing new code... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>