Re: [PATCH] xfs: reset child dir '..' entry when unlinking child

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 7 Jul 2021 09:07:36 +1000

On Tue, Jul 06, 2021 at 02:49:29PM -0700, Darrick J. Wong wrote:
> On Tue, Jul 06, 2021 at 07:20:21AM +0800, Gao Xiang wrote:
> > Hi Dave and Christoph,
> > 
> > On Tue, Jul 06, 2021 at 08:09:25AM +1000, Dave Chinner wrote:
> > > On Fri, Jul 02, 2021 at 08:02:33PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@xxxxxxxxxx>
> > > > 
> > > > While running xfs/168, I noticed a second source of post-shrink
> > > > corruption errors causing shutdowns.
> > > > 
> > > > Let's say that directory B has a low inode number and is a child of
> > > > directory A, which has a high number.  If B is empty but open, and
> > > > unlinked from A, B's dotdot link continues to point to A.  If A is then
> > > > unlinked and the filesystem shrunk so that A is no longer a valid inode,
> > > > a subsequent AIL push of B will trip the inode verifiers because the
> > > > dotdot entry points outside of the filesystem.
> > > 
> > > So we have a directory inode that is empty and unlinked but held
> > > open, with a back pointer to an invalid inode number? Which can
> > > never be followed, because the directory has been unlinked.
> > > 
> > > Can't this be handled in the inode verifier? This seems to me to
> > > be a pretty clear cut case where the ".." back pointer should
> > > always be considered invalid (because the parent dir has no
> > > existence guarantee once the child has been removed from it), not
> > > just in the situation where the filesystem has been shrunk...
> > 
> > Yes, I agree all of this, this field can be handled by the inode
> > verifier. The only concern I can think out might be fs freeze
> > with a directory inode that is empty and unlinked but held open,
> > and then recovery on a unpatched old kernels, not sure if such
> > case can be handled properly by old kernel verifier.
> 
> I decided against changing the shortform directory verifier to ignore
> the dotdot pointer on nlink==0 directories because older kernels will
> still have the current behavior and that will cause recovery failure for
> the following sequence:
> 
> 1. create A and B and delete A as per above, leave B open
> 2. shrink fs so that A is no longer a valid inode number
> 3. flush the shrink transaction to disk
> 4. futimens(B) (or anything to dirty the inode)
> 5. crash
> 6. boot old kernel
> 7. try to recover fs with old kernel, which will crash since B hadn't
>    been inactivated

Yup, I can see how that would happen, but patching the directory
code still smells like band-aid to me on multiple levels.

At first glance, this seems like the first step in a on-going game
of whack-a-mole. If shrink is changing how the on disk format needs
to be evaluated in ways that older kernels can't safely deal with,
then shrink needs to be setting an incompat feature bit to prevent
older kernels from crashing trying to parse the on-disk format.

On deeper consideration after looking at the code, why is it even
considered safe to be recovering pre- and post- shrink operations in
a single log recovery operation? i.e. I see nothing in the shrink
code that quiesces the log and flushes all the dirty metadata to
stable storage prior to logging the superblock geometry changes and
returning to userspace.

I can think of another situation where this might be problematic -
recovering buffers containing unlinked inode lists updates. If the
inode cluster buffers we recover from transactions before the shrink
point to inodes that are beyond EOFS after the shrink, we recover
them and write them to disk, only for them to be read again and
written to disk when recovering subsequent transactions prior
to shrink.

IOWs, we can have transient on-disk state during log recovery where
stuff before the shrink transaction points to objects beyond the
EOFS after the shrink, but are still considered to be valid by log
recovery. If the device has already been shrunk, then log recovery
has just created a transient corrupt on-disk state, so if log
recovery fails at this point (for whatever reason), we've got a mess
on our hands.

I suspect that this means we could also have problems in the AIL,
too, but I haven't thought that far through this right now. The
transient corrupt log recovery state can be fixed by forcing the AIL
to be written back before we execute the shrink transaction so the
shrink transaction ends up being written to the tail of the log.
Doing this would also fix any sort of transient AIL ordering issue
that might exist around a shrink operation....

So, yeah, I think this problem hints at bigger issues with shrink,
on disk format interpretation and transient log recovery states. I
think we need to finish off all the infrastructure shrink needs
before we go any further with it....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx