[PATCH 00/22 V4] xfs: metadata verifiers

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 6 Nov 2012 16:13:11 +1100

Hi folks,

Fourth version of the buffer verifier series. The read verifier
infrastructure is described here:

http://oss.sgi.com/archives/xfs/2012-10/msg00146.html

The second version with write verifiers is described here:

http://oss.sgi.com/archives/xfs/2012-10/msg00280.html

This version add write verifiers to all buffers that aren't directly
read (i.e. via xfs_buf_get*() interfaces), and drops the log
recovery verifiers from the series as it really needs more buffer
item format flags to do relaibly.

The seris is just about ready to go - it passes all of xfstests here
except for 070. With the addition of the getbuf write verifiers,
this series is now detecting a corrupt xfs_da_node buffer being
written to disk.  It appears to be a new symptom of known problem,
as tracing indicates that the test is triggering the same double
split/join pattern as described here:

http://oss.sgi.com/archives/xfs/2012-03/msg00347.html

There are signs that test 117 might also trigger the situation - it
is hitting the double leaf split code as well, but not following it
up with a join so it hasn't triggered the bad write (yet!).

This is a pre-existing corruption, that the above post took 3 solid
days of debugging to understand enough to write that description.
I'm kind of glad I did right it now, because it was a great
refresher.  However, it might take me another 3 days to did the
problem out, and I'd really like to get this code into the dev tree
ASAP.

Cheers,

Dave.

--

Changes in version 4:
- fixed function names in agf/agi verifier corruption output
- fixed xfs_da_node verify function (use && instead of ||)
- dropped log recovery verifier patch as the recovery of remote
  symlink buffers cannot be distinguished from other metadata magic
  numbers. Will be re-introduced when the CRC format change is made
- added write verifiers for newly allocated buffers.
- all the bug fixes have been moved to a separate series that needs
  to be applied first (some are already in the dev tree).

Changes in version 3:
- update agfl verfier commit to mention debug checks are being done
  unconditionally now.
- fixed agfl verifier null point crash when invalid block numbers
  are found
- ifdef'd out agfl verifier checks as they are not reliable because
  mkfs does not initialise the full AGFL to known values.
- fixed quiet mount flag handling for superblock verification.
- directorry -> directory
- convert to struct buf_ops method of attaching verifiers to the
  buffer. This provides a much cleaner abstraction and simpler
  future expansion of operations on the buffer. It removes a great
  deal of code that is repeated through all the verifiers, too, by
  separating them from buffer IO completion processing.
- add initial support for log write verifiers

  Log write verifiers are, in general, identical to the existing
  verifiers. There are only a small number of modifications
  necessary, mainly due to log recovery occurring before certain
  in-memory structures are initialised (e.g. the struct xfs_perag).
  Write verifiers that need different checks during recovery do so
  via detection of the XLOG_ACTIVE_RECOVERY flag on the log.

  Log recovery does not do read verification of the buffers at this
  point in time, mainly due to the fact we don't know what the
  contents of the buffer is before we read it - the buffer logging
  is generic and content unaware. However, almost all metadata has
  magic numbers in it, so after the changes have been replayed into
  the buffer we can snoop the magic number out of the buffer and
  attach the appropriate verifier before it is written back. Hence
  we should catch gross corruptions introduced by recovery errors.

Changes in Version 2:

- fixed use of xfs_dir2_db_t instead of xfs_dablk_t in directory and
  attr read functions (found when testing xfstests --large-fs on a
  500TB fs and attribute block numbers went beyond 32 bits). This
  mistake was copy-n-pasted several times.
- fixed use of "int map_type" instead of "xfs_daddr_t mappedbno" in
  directory and attr read functions.
- fixed incorrect logic in xfs_dir2_block_verify where a failed
  block check would not clear the block_ok flag correctly
- invalidate allocbt->freelist buffers so they don't get written
  after being freed and while still on the freelist
- added initial suppor for write verifiers.

  Write verifiers are similar to read verifiers, the are simply
  called just prior to issuing the IO on the buffer. The buffer is
  locked at this point, so we are guaranteed an unchanging buffer
  to work from.

  The initial write verifiers are simply the same as the read
  verifiers, except they don't have the ioend processing in them. A
  failure of the write verifier will cause the filesystem to shut
  down as writing invalid metadata to disk is a bad thing. The write
  verifier for the alloc btree blocks was what discovered the
  writing of freed allocbt blocks to disk from the free list.

  Eventually, the metadata CRC will be calculated in the write
  verifier after validating that the buffer contents are valid.

This is the next step along the road to metadata CRC checking. What
the series does is add an iodone callback to most metadata buffer
read operations that is only executed when the buffer is physically
read from disk.  Read operations that hit the cache do no trigger a
verification, as CRCs only protect the on-disk metadata and the
in-memory buffer can be changed at any time after it is read without
recalculating the CRC of the buffer.

Hence we need infrastructure that only triggers verification as a
result of a physical read IO. We can do that easily enough via the
existing b_iodone callback infrastructure. This callback is
currently only used by writes, and callbacks clear themselves from
the buffer b_iodone function pointer once they are run. By following
this same usage pattern, we can attach a verifier callback to the
buffer when it is first read from disk and clear it from the
b_iodone callback once it has been executed, preserving the existing
behaviour for buffers that are cached in memory.

To do this, we nee dto add a verifier function to all the buffer
read functions that can be attached to the buffer if we are going to
execute a physical read to fill the buffer. The iodone callback is
only passed the buffer, so the only context for verification we have
is the function being called.

Hence the initial verifier functions simply check the buffer for
valid contents according to the type that is expected in the buffer.
In future, more targetted verifiers could be implmented to verify
that buffers are in certain states or with certain constraints, but
that is not a focus of this patch set.

If a verifier function detects an inconsistency or corruption, the
only way it can pass that error to waiters is via placing an error
on the buffer itself via xfs_buf_ioerror(). A validation error
should set the error to EFSCORRUPTED, so that a validation error can
be distinguished from an IO failure, which will result in an EIO
being set on the buffer. Once processing is complete, the iodone
function is cleared and the next stage of ioend processing is
triggered by calling xfs_buf_ioend(). This is typically done like
this:

void verifier_fn(struct xfs_buf *bp)
{
	// check buffer

	if (!buf_ok) {
		xfs_error_report();
		xfs_buf_ioerror(bp, EFSCORRUPTED);
	}

	bp->b_iodone = NULL;
	xfs_buf_ioend(bp);
}

Hence callers that are returned a buffer need to check the buffer
for a validation error before using it. If special error handling
for a validation error is necessary, it needs to catch a
EFSCORRUPTED error. In most cases (e.g.  xfs_trans_read_buf_map())
this checking is already done, so there's relatively few places that
need modifications to their error handling to handle this.

The verifiers still emit error reports with stack traces, but they
are probably less useful than they were because the stack trace will
simply point to the IO completion stack. It is an open question as
to whether the error report should be in the verifier or issued by
the waiting context - I'm happy to have reports in the waiting
context in the places where there isn't already an error report if
necessary.

The next step in this process (i.e. the next patch set) is to add a
pre-write callback to verify the contents of the buffer just before
it is issued to disk.  This will allow us to verify that detectable
in-memory corruption is not being propagated to disk, and will use
the same verifier function as the read code.  Once these verifiers
are in place, the infrastructure for enabling CRC validation of
metadata buffers will be in place.

These write verifiers will initially be identical to these read
verifiers, but once CRC verification and calculation is added, the
callbacks will be different but the verifier identical.

It should be noted that this patch set does not quite cover all
metadata types - remote attribute and symlink blocks are not
currently handled because there is no way to validate those buffers
are good or bad because all they contain is user data. Verifiers for
these types of metadata buffers will be added when CRC protection is
added to these types.

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs