Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors

Dave Chinner <david@xxxxxxxxxxxxx> · Sat, 15 Jun 2013 10:56:39 +1000

On Fri, Jun 14, 2013 at 02:44:53PM -0500, Ben Myers wrote:
> Unfortunately log replay is broken.  The verifier has detected this and stopped
> replay.  Ideally the solution would be to fix log replay, but that is going to
> take some time.  So, in the near term we're just going to disable the verifier
> to allow replay to complete.
> 
> I'm suggesting that this disabling be done conditionally on CONFIG_XFS_DEBUG so
> that developers still have a chance at hitting the log replay problem, and a
> comment should be added explaining that we've disabled the verifier due to a
> specific bug as a temporary workaround and we'll re-enable the verifier once
> it's fixed.  I'll update the patch and repost.
> 
> Are you guys arguing that the log replay bug should not be fixed?

I'm not arguing that it should not be fixed, I'm *stating* that it
*can't be fixed* for non-CRC filesystems. That is, the best we can
do is work around this deficiency in log recovery with some kind of
self-defusing warning....

To fix it properly, you need to know the age of the object being
overwritten relative to the age of overwrite data. For non-CRC
filesystems we don't have that information in the metadata object
being overwritten. We can't even correctly identify the object being
overwritten.

So, it's simply not fixable in log recovery for non-CRC filesystems,
and the LSN stamped in every piece of metadata at writeback time for
CRC enabled filesystems is designed precisely to avoid this problem.

Indeed, the LSN stamp is a far more effective method than what used
to be in the inode core to try to avoid the problem for unlogged
inode size updates to try to prevent log recovery from replaying
inode core updates over more recently written inodes - the
di_flushiter field.  Note the comment in xfs_flush_int():

        /*
         * bump the flush iteration count, used to detect flushes which
         * postdate a log record during recovery. This is redundant as we now
         * log every change and hence this can't happen. Still, it doesn't hurt.
         */
        ip->i_d.di_flushiter++;

And this in xlog_recover_inode_pass2():

        /* Skip replay when the on disk inode is newer than the log one */
        if (dicp->di_flushiter < be16_to_cpu(dip->di_flushiter)) {

This recovery problem has been around forever, and it we cannot fix
log recovery with age information in the metadata on disk that
recovery is going overwrite. CRC enabled filesystems have that
information on disk, existing filesystems don't. Therefore, we can
only solve the recovery problem for CRC enabled filesystems...

We could probably also avoid the problem by modifying the way we do
writeback from the AIL to limit it only to objects at the tail LSN,
but that has a horrific performance penalty associated with it for
many common workloads because of the way relogging works. And for a
problem that I've suspected has occurred maybe 5 times in 10 years,
modifying metadata writeback to avoid this problem is a bad tradeoff
to make for just about everyone...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs