Re: [PATCH 3/5] ext4: Mark block group as corrupt on block bitmap error

"Theodore Ts'o" <tytso@xxxxxxx> · Wed, 28 Aug 2013 18:26:59 -0400

On Mon, Jul 22, 2013 at 08:38:33PM -0700, Darrick J. Wong wrote:
> On Fri, Jul 19, 2013 at 04:55:52PM -0700, Darrick J. Wong wrote:
> > When we notice a block-bitmap corruption (because of device failure or
> > something else), we should mark this group as corrupt and prevent further block
> > allocations/deallocations from it. Currently, we end up generating one error
> > message for every block in the bitmap. This potentially could make the system
> > unstable as noticed in some bugs. With this patch, the error will be printed
> > only the first time and mark the entire block group as corrupted. This prevents
> > future access allocations/deallocations from it.

Thanks, applied....

> Hmm.  I think we need to have ext4_count_free_clusters() act as though corrupt
> block groups have "zero" free blocks so that mballoc will pass the -ENOSPC
> errors back to the upper layers.  Afaict, if one doesn't do this, ext4
> encounters the situation where marking the blocks in use fails, yet the fs
> thinks there are free blocks still and ... leaves the pages dirty forever,
> instead of simply failing.

Yes, that's something we should probably add to make things to be more
robust in the case where we have huge numbers of corrupted (or
hardware failures) in the block bitmaps.   

> Just trying this really quickly, if I blast /all/ the block groups, I see
> unstoppable errors in dmesg.

What sort of errors did you end up seeing?

> The other thing I noticed is that if one turns delalloc mode on, performs a
> live corruption of the bg descriptors, and then dd's a big file to the fs,
> there's no error reported back to userspace either in write(), sync(), or even
> umount().  Meanwhile, dmesg is getting hit with tons of corrupted-bitmap
> errors.

I'm not sure there's much we can do about this.  On the other hand,
how realistic of a threat is this.  If it's happening randomly, how
likely is this to happen?  And if it's a deliberate corruption, the
attacker can probably do a lot worse.

In practice, these weren't think we were really worried about when we
primarily worried about hardware failures, since hardware failures are
random, and so if the errors are affecting a large number of block
bitmaps, the storage device is probably completely toasted and there's
nothing we can do about it anyway.

When metadata checksums are enabled, this gets trickier, since it's
possible for a large number of metadata checksums to be corrupted in
the bg descriptors, especially if the bg descriptors get written to
while the file system is mounted.  This will smash a huge number of
checksums, and then badness will happen.  But realistically, bad
things would happen if that happened while the file system is mounted
even without checksums being enabled.  It maybe that the best thing we
can do is to some kind of rate limiting with log messages, or some
kind of hueristic where if a sufficient number of different checksums
are found to be broken, we take much more drastic, such as
unconditionally shutting down the file system.

The main issue here is that errors=continue is used if we want to do
some amount of recovery after certain types of file system corruption,
but what we really need is a mode where we can continue after certain
types of fs errors (especially if userspace is doing things its own
data block checksums and has its own recovery mechanisms at the
cluster file system level).  But if things gets really, really, bad,
we shouldn't trying to bull ahead in the face of errors when it's
clear it's going to be counterproductive.

> More for me to ponder....

Indeed...

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html