On Tue, Jul 06, 2021 at 02:49:29PM -0700, Darrick J. Wong wrote: > On Tue, Jul 06, 2021 at 07:20:21AM +0800, Gao Xiang wrote: > > Hi Dave and Christoph, > > > > On Tue, Jul 06, 2021 at 08:09:25AM +1000, Dave Chinner wrote: > > > On Fri, Jul 02, 2021 at 08:02:33PM -0700, Darrick J. Wong wrote: > > > > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > > > > > While running xfs/168, I noticed a second source of post-shrink > > > > corruption errors causing shutdowns. > > > > > > > > Let's say that directory B has a low inode number and is a child of > > > > directory A, which has a high number. If B is empty but open, and > > > > unlinked from A, B's dotdot link continues to point to A. If A is then > > > > unlinked and the filesystem shrunk so that A is no longer a valid inode, > > > > a subsequent AIL push of B will trip the inode verifiers because the > > > > dotdot entry points outside of the filesystem. > > > > > > So we have a directory inode that is empty and unlinked but held > > > open, with a back pointer to an invalid inode number? Which can > > > never be followed, because the directory has been unlinked. > > > > > > Can't this be handled in the inode verifier? This seems to me to > > > be a pretty clear cut case where the ".." back pointer should > > > always be considered invalid (because the parent dir has no > > > existence guarantee once the child has been removed from it), not > > > just in the situation where the filesystem has been shrunk... > > > > Yes, I agree all of this, this field can be handled by the inode > > verifier. The only concern I can think out might be fs freeze > > with a directory inode that is empty and unlinked but held open, > > and then recovery on a unpatched old kernels, not sure if such > > case can be handled properly by old kernel verifier. > > I decided against changing the shortform directory verifier to ignore > the dotdot pointer on nlink==0 directories because older kernels will > still have the current behavior and that will cause recovery failure for > the following sequence: > > 1. create A and B and delete A as per above, leave B open > 2. shrink fs so that A is no longer a valid inode number > 3. flush the shrink transaction to disk > 4. futimens(B) (or anything to dirty the inode) > 5. crash > 6. boot old kernel > 7. try to recover fs with old kernel, which will crash since B hadn't > been inactivated Yup, I can see how that would happen, but patching the directory code still smells like band-aid to me on multiple levels. At first glance, this seems like the first step in a on-going game of whack-a-mole. If shrink is changing how the on disk format needs to be evaluated in ways that older kernels can't safely deal with, then shrink needs to be setting an incompat feature bit to prevent older kernels from crashing trying to parse the on-disk format. On deeper consideration after looking at the code, why is it even considered safe to be recovering pre- and post- shrink operations in a single log recovery operation? i.e. I see nothing in the shrink code that quiesces the log and flushes all the dirty metadata to stable storage prior to logging the superblock geometry changes and returning to userspace. I can think of another situation where this might be problematic - recovering buffers containing unlinked inode lists updates. If the inode cluster buffers we recover from transactions before the shrink point to inodes that are beyond EOFS after the shrink, we recover them and write them to disk, only for them to be read again and written to disk when recovering subsequent transactions prior to shrink. IOWs, we can have transient on-disk state during log recovery where stuff before the shrink transaction points to objects beyond the EOFS after the shrink, but are still considered to be valid by log recovery. If the device has already been shrunk, then log recovery has just created a transient corrupt on-disk state, so if log recovery fails at this point (for whatever reason), we've got a mess on our hands. I suspect that this means we could also have problems in the AIL, too, but I haven't thought that far through this right now. The transient corrupt log recovery state can be fixed by forcing the AIL to be written back before we execute the shrink transaction so the shrink transaction ends up being written to the tail of the log. Doing this would also fix any sort of transient AIL ordering issue that might exist around a shrink operation.... So, yeah, I think this problem hints at bigger issues with shrink, on disk format interpretation and transient log recovery states. I think we need to finish off all the infrastructure shrink needs before we go any further with it.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx