On Fri, Jan 19, 2024 at 02:36:45PM -0500, Brian Foster wrote: > We've had reports on distro (pre-deferred inactivation) kernels that > inode reclaim (i.e. via drop_caches) can deadlock on the s_umount > lock when invoked on a frozen XFS fs. This occurs because > drop_caches acquires the lock and then blocks in xfs_inactive() on > transaction alloc for an inode that requires an eofb trim. unfreeze > then blocks on the same lock and the fs is deadlocked. Yup, but why do we need to address that in upstream kernels? > With deferred inactivation, the deadlock problem is no longer > present because ->destroy_inode() no longer blocks whether the fs is > frozen or not. There is still unfortunate behavior in that lookups > of a pending inactive inode spin loop waiting for the pending > inactive state to clear, which won't happen until the fs is > unfrozen. Largely we took option 1 from the previous discussion: | ISTM the currently most viable options we've discussed are: | | 1. Leave things as is, accept potential for lookup stalls while frozen | and wait and see if this ever really becomes a problem for real users. https://lore.kernel.org/linux-xfs/YeVxCXE6hXa1S%2Fic@bfoster/ And really it hasn't caused any serious problems with the upstream and distro kernels that have background inodegc. Regardless, the spinning lookup problem seems pretty easy to avoid. We can turn it into a blocking lookup if the filesytsem is frozen simply by cycling sb_start_write/sb_end_write if there's a pending inactivation on that inode, and now the lookup blocks until the filesystem is thawed. Alternatively, we could actually implement reinstantiation of the VFS inode portion of the inode and reactivate the inode if inodegc is pending. As darrick mentioned we didn't do this because of the difficulty in removing arbitrary items from the middle of llist structures. However, it occurs to me that we don't even need to remove the inode from the inodegc list to recycle it. We can be lazy - just need to set a flag on the inode to cancel the inodegc and have inodegc clear the flag and skip over cancelled inodes instead of inactivating them. Then if a gc cancelled inode then gets reclaimed by the VFS again, the inodegc queueing code can simply remove cancelled flag and it's already queued for processing.... I think this avoids all the problems with needing to do inodegc cleanup while the filesystem is frozen, so I leaning towards this as the best way to solve this problem in the upstream kernel. > This was always possible to some degree, but is > potentially amplified by the fact that reclaim no longer blocks on > the first inode that requires inactivation work. Instead, we > populate the inactivation queues indefinitely. The side effect can > be observed easily by invoking drop_caches on a frozen fs previously > populated with eofb and/or cowblocks inodes and then running > anything that relies on inode lookup (i.e., ls). As we discussed last time, that is largely avoidable by only queuing inodes that absolutely need cleanup work. i.e. call xfs_can_free_eofblocks() and make it return false is the filesystem has an elevated freeze state because eof block clearing at reclaim is not essential for correct operation - the inode simply behaves as if it has XFS_DIFLAG_PREALLOC set on it. This will happen with any inode that has had fallocate() called on it, so it's not like it's a rare occurrence, either. The cowblocks case is much a much rarer situation, so that case could potentially just queue inodes those until the freeze goes away. But if we can reinstantiate inodegc queued inodes as per my suggestion above then we can just leave the inodegc queuing unchanged and just not care how long they block for because if they are needed again whilst queued we can reuse them immediately.... > Finally, since the deadlock issue was present for such a long time, > also document the subtle ->destroy_inode() constraint to avoid > unintentional reintroduction of the deadlock problem in the future. That's still useful, though. -Dave. -- Dave Chinner david@xxxxxxxxxxxxx