Best practices for handling drive failures during a run?

Nick Neumann <nick@xxxxxxxxxxxxxxxx> · Wed, 7 Sep 2022 10:58:04 -0500

I was wondering if there were any recommendations/suggestions on
handling drive failures during a fio run. I hit one yesterday with a
60 second mixed use test on an SSD. 51 seconds in, the drive basically
stopped responding. (A separate program that periodically calls
smartctl to get drive state also showed something was up, as data like
temperature was missing.)

At 107 seconds, a read completed, and fio exited.

It made me wonder what would have happened if the test was not time
limited - e.g., a full drive write. Would it have just hung, waiting
forever? Or would the OS eventually get back to fio and tell it the
submitted operations have failed and fio would exit?

Any ideas on ways to test the behavior, or areas of the code to look at?

I'm basically looking for input on how to make sure fio does not hang
in such situations. And even better would be if I could get fio to
return an error if it does happen - I could see the controls for
reporting error being configurable - e.g., if an operation doesn't
return for N seconds, stop the job and return an error. I'm happy to
work on implementing stuff to help with this, and wanted to see where
things currently are at and what others thought about the general
issue.

Thanks,
Nick