On Fri, May 18, 2018 at 01:05:20PM -0600, Andreas Dilger wrote: > On May 18, 2018, at 1:49 AM, Kent Overstreet <kent.overstreet@xxxxxxxxx> wrote: > > > > Signed-off-by: Kent Overstreet <kent.overstreet@xxxxxxxxx> > > I agree with Christoph that even if there was some explanation in the cover > letter, there should be something at least as good in the patch itself. The > cover letter is not saved, but the commit stays around forever, and should > explain how this should be added to code, and how to use it from userspace. > > > That said, I think this is a useful functionality. We have something similar > in Lustre (OBD_FAIL_CHECK() and friends) that is necessary for being able to > test a distributed filesystem, which is just a CPP macro with an unlikely() > branch, while this looks more sophisticated. This looks like it has some > added functionality like having more than one fault enabled at a time. > If this lands we could likely switch our code over to using this. This is pretty much what I was looking for, I just wanted to know if this patch was interesting enough to anyone that I should spend more time on it or just drop it :) Agreed on documentation. I think it's also worth factoring out the functionality for the elf section trick that dynamic debug uses too. > Some things that are missing from this patch that is in our code: > > - in addition to the basic "enabled" and "oneshot" mechanisms, we have: > - timeout: sleep for N msec to simulate network/disk/locking delays > - race: wait with one thread until a second thread hits matching check > > We also have a "fail_val" that allows making the check conditional (e.g. > only operation on server "N" should fail, only RPC opcode "N", etc). Those all sound like good ideas... fail_val especially, I think with that we'd have all the functionality the existing fault injection framework has (which is way to heavyweight to actually get used, imo)