Re: [PATCH 2/2] ext4: simulate various I/O and checksum errors when reading metadata

"Theodore Y. Ts'o" <tytso@xxxxxxx> · Thu, 21 Nov 2019 20:00:26 -0500

On Thu, Nov 21, 2019 at 04:09:33PM -0800, Darrick J. Wong wrote:
> > +static inline int ext4_simulate_fail(struct super_block *sb,
> > +				     unsigned long flag)
> 
> Nit: bool?

Sure, I'll do this for the next version.

> If I'm reading this correctly, this means that userspace sets a
> s_simulate_fail bit via sysfs knob, and the next time the filesystem
> calls ext4_simulate_fail with the same bit set in @flag we'll return
> true to say "simulate the failure" and clear the bit in s_simulate_fail?
> 
> IOWs, the simulated failures have to be re-armed every time?

Yes, that's correct.

> Seems reasonable, but consider the possibility that in the future it
> might be useful if you could set up periodic failures (e.g. directory
> lookups fail 10% of the time) so that you can see how something like
> fsstress reacts to less-predictable failures?

So in theory, we could do that with dm_flakey --- but that's a pain in
the tuckus, since you have to specify the LBA for the directory blocks
that you might want to have fail.  I implemented this so I could have
a quick and dirty way of testing the first patch in this series (and
in fact, I found a bug in the first version of the previous patch, so
I'm glad I spent the time to implement the test patch :-).

What might be interesting to do is some kind of eBPF hook where we
pass in the block #, inode #, and metadata type, and the ePBF program
could do use a much more complex set of criteria in terms of whether
or not to trigger an EIO, or how to fuzz a particular block to either
force a CRC failure, or to try to find bugs ala Hydra[1] (funded via a
Google Faculty Research Award grant), but using a much more glass-box
style test approach.

[1] https://gts3.org/~sanidhya/pubs/2019/hydra.pdf

This would be a lot more work, and I'm not sufficiently up to speed
with eBPF, and I just needed a quick and dirty testing scheme.

The reason why I think it's worthwhile to land this patch (as opposed
to throwing it away after doing the development work for the previous
patch) is that it's a relatively small set of changes, and all of the
code disappears if CONFIG_DEBUG_EXT4 is not enabled.  So it has no
performance cost on production kernels, and it's highly unlikely that
users would have a reason to use this feature on production use cases,
so ripping this out if and when we have a more functional eBPF testing
infrastructure to replace it shouldn't really be a problem.

					- Ted

P.S.  A fascinating question is whether we could make the hooks for
this hypothetical eBPF hook general enough that it could work for more
than just ext4, but for other file systems.  The problem is that the
fs metadata types are not going to be same across different file
systems, so that makes the API design quite tricky; and perhaps not
worth it?