On Thu, Nov 21, 2019 at 04:09:33PM -0800, Darrick J. Wong wrote: > > +static inline int ext4_simulate_fail(struct super_block *sb, > > + unsigned long flag) > > Nit: bool? Sure, I'll do this for the next version. > If I'm reading this correctly, this means that userspace sets a > s_simulate_fail bit via sysfs knob, and the next time the filesystem > calls ext4_simulate_fail with the same bit set in @flag we'll return > true to say "simulate the failure" and clear the bit in s_simulate_fail? > > IOWs, the simulated failures have to be re-armed every time? Yes, that's correct. > Seems reasonable, but consider the possibility that in the future it > might be useful if you could set up periodic failures (e.g. directory > lookups fail 10% of the time) so that you can see how something like > fsstress reacts to less-predictable failures? So in theory, we could do that with dm_flakey --- but that's a pain in the tuckus, since you have to specify the LBA for the directory blocks that you might want to have fail. I implemented this so I could have a quick and dirty way of testing the first patch in this series (and in fact, I found a bug in the first version of the previous patch, so I'm glad I spent the time to implement the test patch :-). What might be interesting to do is some kind of eBPF hook where we pass in the block #, inode #, and metadata type, and the ePBF program could do use a much more complex set of criteria in terms of whether or not to trigger an EIO, or how to fuzz a particular block to either force a CRC failure, or to try to find bugs ala Hydra[1] (funded via a Google Faculty Research Award grant), but using a much more glass-box style test approach. [1] https://gts3.org/~sanidhya/pubs/2019/hydra.pdf This would be a lot more work, and I'm not sufficiently up to speed with eBPF, and I just needed a quick and dirty testing scheme. The reason why I think it's worthwhile to land this patch (as opposed to throwing it away after doing the development work for the previous patch) is that it's a relatively small set of changes, and all of the code disappears if CONFIG_DEBUG_EXT4 is not enabled. So it has no performance cost on production kernels, and it's highly unlikely that users would have a reason to use this feature on production use cases, so ripping this out if and when we have a more functional eBPF testing infrastructure to replace it shouldn't really be a problem. - Ted P.S. A fascinating question is whether we could make the hooks for this hypothetical eBPF hook general enough that it could work for more than just ext4, but for other file systems. The problem is that the fs metadata types are not going to be same across different file systems, so that makes the API design quite tricky; and perhaps not worth it?