On 2022/09/10 17:27, Damien Le Moal wrote: > On 2022/09/09 1:24, Nick Neumann wrote: >> On Wed, Sep 7, 2022 at 4:01 PM Damien Le Moal >> <damien.lemoal@xxxxxxxxxxxxxxxxxx> wrote: >> >>> Unless you are using continue_on_error=io (or "all"), fio will stop if it >>> sees an IO error, or at least the job that gets the IO error will stop. >>> The IO error will come from the kernel when your drive stops responding >>> (IO timeout and is failed and the drive is reset in that case). >> >> Thanks for this info. >> >>> Which behavior ? That fio stops ? You can try continue_on_error=none and >>> fio will not stop until it reaches the time or size limit, even if some >>> IOs fail. >> >> I would like fio to fail and exit when the I/O error happens. I was >> wondering about a way to setup a scenario where an artificial IO error >> will occur to make sure it does, if that makes sense. > > You can use write-long, to "destroy" sectors: you will get errors when > attempting to read the affected sectors. But that is a really big hammer. A Note: write long is for ATA drives only. That does not apply to nvme. > simpler solution is to use dm-flakey to create "soft" IO errors. And Vincent also pointed out null_blk error injection. dm-flakey can go on top of any block device. > >> >>> The default IO timeout for the kernel is 30s. If your drive stops >>> responding for more than that, IOs will be aborted and failed (the user >>> sees an error) and drive reset. >> >> Hmm. I had 65 seconds between any I/O; it sounds like that would've >> been enough to fail things, but fio returned immediately after that 65 >> second delayed I/O, and with no error. > > The IO was likely retried. > >> >> I also found the drive timeout error in syslog: >> Sep 7 12:37:43 localhost kernel: [ 4354.600211] nvme nvme0: I/O 870 >> QID 4 timeout, aborting >> Sep 7 12:37:43 localhost kernel: [ 4354.615429] nvme nvme0: Abort status: 0x0 >> Sep 7 12:38:15 localhost kernel: [ 4386.600297] nvme nvme0: I/O 870 >> QID 4 timeout, reset controller >> Sep 7 12:38:17 localhost kernel: [ 4388.050831] nvme nvme0: 7/0/0 >> default/read/poll queues >> Sep 7 12:38:18 localhost kernel: [ 4389.437287] nvme0n1: AHDI p1 p2 p4 >> Sep 7 12:38:18 localhost kernel: [ 4389.437347] nvme0n1: p2 start >> 2240010287 is beyond EOD, truncated >> Sep 7 12:38:18 localhost kernel: [ 4389.437350] nvme0n1: p4 start >> 2472081425 is beyond EOD, truncated >> >> Combining the fio and syslog, the chain of events appears to be: >> 4332 seconds - drive IO stops >> 4353 seconds - syslog entry for timeout/abort >> 4386 seconds - syslog entry for timeout/reset >> 4387 seconds - read completes and fio exits without error >> >> Thanks, >> Nick > -- Damien Le Moal Western Digital Research