Re: Best practices for handling drive failures during a run?

Damien Le Moal <damien.lemoal@xxxxxxxxxxxxxxxxxx> · Sat, 10 Sep 2022 17:27:56 +0900

On 2022/09/09 1:24, Nick Neumann wrote:
> On Wed, Sep 7, 2022 at 4:01 PM Damien Le Moal
> <damien.lemoal@xxxxxxxxxxxxxxxxxx> wrote:
> 
>> Unless you are using continue_on_error=io (or "all"), fio will stop if it
>> sees an IO error, or at least the job that gets the IO error will stop.
>> The IO error will come from the kernel when your drive stops responding
>> (IO timeout and is failed and the drive is reset in that case).
> 
> Thanks for this info.
> 
>> Which behavior ? That fio stops ? You can try continue_on_error=none and
>> fio will not stop until it reaches the time or size limit, even if some
>> IOs fail.
> 
> I would like fio to fail and exit when the I/O error happens. I was
> wondering about a way to setup a scenario where an artificial IO error
> will occur to make sure it does, if that makes sense.

You can use write-long, to "destroy" sectors: you will get errors when
attempting to read the affected sectors. But that is a really big hammer. A
simpler solution is to use dm-flakey to create "soft" IO errors.

> 
>> The default IO timeout for the kernel is 30s. If your drive stops
>> responding for more than that, IOs will be aborted and failed (the user
>> sees an error) and drive reset.
> 
> Hmm. I had 65 seconds between any I/O; it sounds like that would've
> been enough to fail things, but fio returned immediately after that 65
> second delayed I/O, and with no error.

The IO was likely retried.

> 
> I also found the drive timeout error in syslog:
> Sep  7 12:37:43 localhost kernel: [ 4354.600211] nvme nvme0: I/O 870
> QID 4 timeout, aborting
> Sep  7 12:37:43 localhost kernel: [ 4354.615429] nvme nvme0: Abort status: 0x0
> Sep  7 12:38:15 localhost kernel: [ 4386.600297] nvme nvme0: I/O 870
> QID 4 timeout, reset controller
> Sep  7 12:38:17 localhost kernel: [ 4388.050831] nvme nvme0: 7/0/0
> default/read/poll queues
> Sep  7 12:38:18 localhost kernel: [ 4389.437287]  nvme0n1: AHDI p1 p2 p4
> Sep  7 12:38:18 localhost kernel: [ 4389.437347] nvme0n1: p2 start
> 2240010287 is beyond EOD, truncated
> Sep  7 12:38:18 localhost kernel: [ 4389.437350] nvme0n1: p4 start
> 2472081425 is beyond EOD, truncated
> 
> Combining the fio and syslog, the chain of events appears to be:
> 4332 seconds - drive IO stops
> 4353 seconds - syslog entry for timeout/abort
> 4386 seconds - syslog entry for timeout/reset
> 4387 seconds - read completes and fio exits without error
> 
> Thanks,
> Nick

-- 
Damien Le Moal
Western Digital Research