Re: Best practices for handling drive failures during a run?

Nick Neumann <nick@xxxxxxxxxxxxxxxx> · Thu, 22 Sep 2022 18:11:13 -0500

On Sat, Sep 10, 2022 at 3:28 AM Damien Le Moal
<damien.lemoal@xxxxxxxxxxxxxxxxxx> wrote:
> You can use write-long, to "destroy" sectors: you will get errors when
> attempting to read the affected sectors. But that is a really big hammer. A
> simpler solution is to use dm-flakey to create "soft" IO errors.

Thank you for mentioning this - I'm not a linux veteran so I did not
know about these tools.

I tried dm-flakey, but when the device is down, the errors are
returned immediately. I also looked at dm-delay, and that actually
worked pretty well for getting fio to sit and wait on an I/O.

Unfortunately I have a hard time getting the delay to be "big". The
time it takes to add the delay rule appears to be a linear function of
the amount of delay, with a very big constant factor. A half second
delay takes 11 seconds to add, and a 5 second delay takes 112 seconds:
sudo time dmsetup create test9 --table "0 1024 delay /dev/nullb1 0 500
/dev/nullb1 0 0"
0.00user
0.00system
0:11.28elapsed
...
sudo time dmsetup create test10 --table "0 1024 delay /dev/nullb1 0
5000 /dev/nullb1 0 0"
0.00user
0.00system
1:52.70elapsed

And unfortunately something breaks at some point, as my attempt to do
a 70 second delay had not finished after 2 hours. I'm experimenting
right now to try to find a smaller but still big value that is useful
for testing the nvme timeout/retry defaults. I've seen code snippets
online though that set the delay to 100 seconds, so I'm at a loss why
the time to do it is growing so large on my system.