Re: Best practices for handling drive failures during a run?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Sep 7, 2022 at 4:01 PM Damien Le Moal
<damien.lemoal@xxxxxxxxxxxxxxxxxx> wrote:

> Unless you are using continue_on_error=io (or "all"), fio will stop if it
> sees an IO error, or at least the job that gets the IO error will stop.
> The IO error will come from the kernel when your drive stops responding
> (IO timeout and is failed and the drive is reset in that case).

Thanks for this info.

> Which behavior ? That fio stops ? You can try continue_on_error=none and
> fio will not stop until it reaches the time or size limit, even if some
> IOs fail.

I would like fio to fail and exit when the I/O error happens. I was
wondering about a way to setup a scenario where an artificial IO error
will occur to make sure it does, if that makes sense.

> The default IO timeout for the kernel is 30s. If your drive stops
> responding for more than that, IOs will be aborted and failed (the user
> sees an error) and drive reset.

Hmm. I had 65 seconds between any I/O; it sounds like that would've
been enough to fail things, but fio returned immediately after that 65
second delayed I/O, and with no error.

I also found the drive timeout error in syslog:
Sep  7 12:37:43 localhost kernel: [ 4354.600211] nvme nvme0: I/O 870
QID 4 timeout, aborting
Sep  7 12:37:43 localhost kernel: [ 4354.615429] nvme nvme0: Abort status: 0x0
Sep  7 12:38:15 localhost kernel: [ 4386.600297] nvme nvme0: I/O 870
QID 4 timeout, reset controller
Sep  7 12:38:17 localhost kernel: [ 4388.050831] nvme nvme0: 7/0/0
default/read/poll queues
Sep  7 12:38:18 localhost kernel: [ 4389.437287]  nvme0n1: AHDI p1 p2 p4
Sep  7 12:38:18 localhost kernel: [ 4389.437347] nvme0n1: p2 start
2240010287 is beyond EOD, truncated
Sep  7 12:38:18 localhost kernel: [ 4389.437350] nvme0n1: p4 start
2472081425 is beyond EOD, truncated

Combining the fio and syslog, the chain of events appears to be:
4332 seconds - drive IO stops
4353 seconds - syslog entry for timeout/abort
4386 seconds - syslog entry for timeout/reset
4387 seconds - read completes and fio exits without error

Thanks,
Nick



[Index of Archives]     [Linux Kernel]     [Linux SCSI]     [Linux IDE]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux SCSI]

  Powered by Linux