Re: [RFC PATCH v2] xfs: run blockgc on freeze to avoid iget stalls after reclaim

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 22 Jan 2024 14:23:44 +1100

On Fri, Jan 19, 2024 at 02:36:45PM -0500, Brian Foster wrote:
> We've had reports on distro (pre-deferred inactivation) kernels that
> inode reclaim (i.e. via drop_caches) can deadlock on the s_umount
> lock when invoked on a frozen XFS fs. This occurs because
> drop_caches acquires the lock and then blocks in xfs_inactive() on
> transaction alloc for an inode that requires an eofb trim. unfreeze
> then blocks on the same lock and the fs is deadlocked.

Yup, but why do we need to address that in upstream kernels?

> With deferred inactivation, the deadlock problem is no longer
> present because ->destroy_inode() no longer blocks whether the fs is
> frozen or not. There is still unfortunate behavior in that lookups
> of a pending inactive inode spin loop waiting for the pending
> inactive state to clear, which won't happen until the fs is
> unfrozen.

Largely we took option 1 from the previous discussion:

| ISTM the currently most viable options we've discussed are:
| 
| 1. Leave things as is, accept potential for lookup stalls while frozen
| and wait and see if this ever really becomes a problem for real users.

https://lore.kernel.org/linux-xfs/YeVxCXE6hXa1S%2Fic@bfoster/

And really it hasn't caused any serious problems with the upstream
and distro kernels that have background inodegc.

Regardless, the spinning lookup problem seems pretty easy to avoid.
We can turn it into a blocking lookup if the filesytsem is frozen
simply by cycling sb_start_write/sb_end_write if there's a pending
inactivation on that inode, and now the lookup blocks until the
filesystem is thawed.

Alternatively, we could actually implement reinstantiation of the
VFS inode portion of the inode and reactivate the inode if inodegc
is pending. As darrick mentioned we didn't do this because of the
difficulty in removing arbitrary items from the middle of llist
structures.

However, it occurs to me that we don't even need to remove the inode
from the inodegc list to recycle it. We can be lazy - just need to
set a flag on the inode to cancel the inodegc and have inodegc clear
the flag and skip over cancelled inodes instead of inactivating
them.  Then if a gc cancelled inode then gets reclaimed by the VFS
again, the inodegc queueing code can simply remove cancelled flag
and it's already queued for processing....

I think this avoids all the problems with needing to do inodegc
cleanup while the filesystem is frozen, so I leaning towards this as
the best way to solve this problem in the upstream kernel.

> This was always possible to some degree, but is
> potentially amplified by the fact that reclaim no longer blocks on
> the first inode that requires inactivation work. Instead, we
> populate the inactivation queues indefinitely. The side effect can
> be observed easily by invoking drop_caches on a frozen fs previously
> populated with eofb and/or cowblocks inodes and then running
> anything that relies on inode lookup (i.e., ls).

As we discussed last time, that is largely avoidable by only queuing
inodes that absolutely need cleanup work. i.e. call
xfs_can_free_eofblocks() and make it return false is the filesystem
has an elevated freeze state because eof block clearing at reclaim
is not essential for correct operation - the inode simply behaves as
if it has XFS_DIFLAG_PREALLOC set on it. This will happen with any
inode that has had fallocate() called on it, so it's not like it's a
rare occurrence, either.

The cowblocks case is much a much rarer situation, so that case could
potentially just queue inodes those until the freeze goes away.

But if we can reinstantiate inodegc queued inodes as per my
suggestion above then we can just leave the inodegc queuing
unchanged and just not care how long they block for because if they
are needed again whilst queued we can reuse them immediately....

> Finally, since the deadlock issue was present for such a long time,
> also document the subtle ->destroy_inode() constraint to avoid
> unintentional reintroduction of the deadlock problem in the future.

That's still useful, though.

-Dave.

-- 
Dave Chinner
david@xxxxxxxxxxxxx