Re: Best practices for handling drive failures during a run?

Damien Le Moal <damien.lemoal@xxxxxxxxxxxxxxxxxx> · Thu, 8 Sep 2022 06:01:08 +0900

On 9/8/22 00:58, Nick Neumann wrote:
> I was wondering if there were any recommendations/suggestions on
> handling drive failures during a fio run. I hit one yesterday with a
> 60 second mixed use test on an SSD. 51 seconds in, the drive basically
> stopped responding. (A separate program that periodically calls
> smartctl to get drive state also showed something was up, as data like
> temperature was missing.)
> 
> At 107 seconds, a read completed, and fio exited.
> 
> It made me wonder what would have happened if the test was not time
> limited - e.g., a full drive write. Would it have just hung, waiting
> forever? Or would the OS eventually get back to fio and tell it the
> submitted operations have failed and fio would exit?

Unless you are using continue_on_error=io (or "all"), fio will stop if it
sees an IO error, or at least the job that gets the IO error will stop.
The IO error will come from the kernel when your drive stops responding
(IO timeout and is failed and the drive is reset in that case).

> 
> Any ideas on ways to test the behavior, or areas of the code to look at?

Which behavior ? That fio stops ? You can try continue_on_error=none and
fio will not stop until it reaches the time or size limit, even if some
IOs fail.

> 
> I'm basically looking for input on how to make sure fio does not hang
> in such situations. And even better would be if I could get fio to
> return an error if it does happen - I could see the controls for
> reporting error being configurable - e.g., if an operation doesn't
> return for N seconds, stop the job and return an error. I'm happy to
> work on implementing stuff to help with this, and wanted to see where
> things currently are at and what others thought about the general
> issue.

The default IO timeout for the kernel is 30s. If your drive stops
responding for more than that, IOs will be aborted and failed (the user
sees an error) and drive reset.

> 
> Thanks,
> Nick

-- 
Damien Le Moal
Western Digital Research