Re: Best practices for handling drive failures during a run?

Damien Le Moal <damien.lemoal@xxxxxxxxxxxxxxxxxx> · Sat, 10 Sep 2022 17:37:11 +0900

On 2022/09/10 17:27, Damien Le Moal wrote:
> On 2022/09/09 1:24, Nick Neumann wrote:
>> On Wed, Sep 7, 2022 at 4:01 PM Damien Le Moal
>> <damien.lemoal@xxxxxxxxxxxxxxxxxx> wrote:
>>
>>> Unless you are using continue_on_error=io (or "all"), fio will stop if it
>>> sees an IO error, or at least the job that gets the IO error will stop.
>>> The IO error will come from the kernel when your drive stops responding
>>> (IO timeout and is failed and the drive is reset in that case).
>>
>> Thanks for this info.
>>
>>> Which behavior ? That fio stops ? You can try continue_on_error=none and
>>> fio will not stop until it reaches the time or size limit, even if some
>>> IOs fail.
>>
>> I would like fio to fail and exit when the I/O error happens. I was
>> wondering about a way to setup a scenario where an artificial IO error
>> will occur to make sure it does, if that makes sense.
> 
> You can use write-long, to "destroy" sectors: you will get errors when
> attempting to read the affected sectors. But that is a really big hammer. A

Note: write long is for ATA drives only. That does not apply to nvme.

> simpler solution is to use dm-flakey to create "soft" IO errors.

And Vincent also pointed out null_blk error injection. dm-flakey can go on top
of any block device.

> 
>>
>>> The default IO timeout for the kernel is 30s. If your drive stops
>>> responding for more than that, IOs will be aborted and failed (the user
>>> sees an error) and drive reset.
>>
>> Hmm. I had 65 seconds between any I/O; it sounds like that would've
>> been enough to fail things, but fio returned immediately after that 65
>> second delayed I/O, and with no error.
> 
> The IO was likely retried.
> 
>>
>> I also found the drive timeout error in syslog:
>> Sep  7 12:37:43 localhost kernel: [ 4354.600211] nvme nvme0: I/O 870
>> QID 4 timeout, aborting
>> Sep  7 12:37:43 localhost kernel: [ 4354.615429] nvme nvme0: Abort status: 0x0
>> Sep  7 12:38:15 localhost kernel: [ 4386.600297] nvme nvme0: I/O 870
>> QID 4 timeout, reset controller
>> Sep  7 12:38:17 localhost kernel: [ 4388.050831] nvme nvme0: 7/0/0
>> default/read/poll queues
>> Sep  7 12:38:18 localhost kernel: [ 4389.437287]  nvme0n1: AHDI p1 p2 p4
>> Sep  7 12:38:18 localhost kernel: [ 4389.437347] nvme0n1: p2 start
>> 2240010287 is beyond EOD, truncated
>> Sep  7 12:38:18 localhost kernel: [ 4389.437350] nvme0n1: p4 start
>> 2472081425 is beyond EOD, truncated
>>
>> Combining the fio and syslog, the chain of events appears to be:
>> 4332 seconds - drive IO stops
>> 4353 seconds - syslog entry for timeout/abort
>> 4386 seconds - syslog entry for timeout/reset
>> 4387 seconds - read completes and fio exits without error
>>
>> Thanks,
>> Nick
> 

-- 
Damien Le Moal
Western Digital Research