On 2022/09/09 1:24, Nick Neumann wrote: > On Wed, Sep 7, 2022 at 4:01 PM Damien Le Moal > <damien.lemoal@xxxxxxxxxxxxxxxxxx> wrote: > >> Unless you are using continue_on_error=io (or "all"), fio will stop if it >> sees an IO error, or at least the job that gets the IO error will stop. >> The IO error will come from the kernel when your drive stops responding >> (IO timeout and is failed and the drive is reset in that case). > > Thanks for this info. > >> Which behavior ? That fio stops ? You can try continue_on_error=none and >> fio will not stop until it reaches the time or size limit, even if some >> IOs fail. > > I would like fio to fail and exit when the I/O error happens. I was > wondering about a way to setup a scenario where an artificial IO error > will occur to make sure it does, if that makes sense. You can use write-long, to "destroy" sectors: you will get errors when attempting to read the affected sectors. But that is a really big hammer. A simpler solution is to use dm-flakey to create "soft" IO errors. > >> The default IO timeout for the kernel is 30s. If your drive stops >> responding for more than that, IOs will be aborted and failed (the user >> sees an error) and drive reset. > > Hmm. I had 65 seconds between any I/O; it sounds like that would've > been enough to fail things, but fio returned immediately after that 65 > second delayed I/O, and with no error. The IO was likely retried. > > I also found the drive timeout error in syslog: > Sep 7 12:37:43 localhost kernel: [ 4354.600211] nvme nvme0: I/O 870 > QID 4 timeout, aborting > Sep 7 12:37:43 localhost kernel: [ 4354.615429] nvme nvme0: Abort status: 0x0 > Sep 7 12:38:15 localhost kernel: [ 4386.600297] nvme nvme0: I/O 870 > QID 4 timeout, reset controller > Sep 7 12:38:17 localhost kernel: [ 4388.050831] nvme nvme0: 7/0/0 > default/read/poll queues > Sep 7 12:38:18 localhost kernel: [ 4389.437287] nvme0n1: AHDI p1 p2 p4 > Sep 7 12:38:18 localhost kernel: [ 4389.437347] nvme0n1: p2 start > 2240010287 is beyond EOD, truncated > Sep 7 12:38:18 localhost kernel: [ 4389.437350] nvme0n1: p4 start > 2472081425 is beyond EOD, truncated > > Combining the fio and syslog, the chain of events appears to be: > 4332 seconds - drive IO stops > 4353 seconds - syslog entry for timeout/abort > 4386 seconds - syslog entry for timeout/reset > 4387 seconds - read completes and fio exits without error > > Thanks, > Nick -- Damien Le Moal Western Digital Research