> -----Original Message----- > From: Nick Neumann [mailto:nick@xxxxxxxxxxxxxxxx] > Sent: Friday, September 9, 2022 1:36 PM > To: Vincent Fu <vincent.fu@xxxxxxxxxxx> > Subject: Re: Best practices for handling drive failures during a run? > > On Thu, Sep 8, 2022 at 9:59 AM Vincent Fu <vincent.fu@xxxxxxxxxxx> > wrote: > > The null_blk device supports error injection via the badblocks configfs > > variable. So you could use it for testing. There is a help guide for setting > up > > null_blk devices via configfs at > https://urldefense.com/v3/__https://protect2.fireeye.com/v1/url?k=bd > b74897-dccaa0d0-bdb6c3d8-74fe485fff30- > 0650a1ccc4e5ca1a&q=1&e=291075fc-ffa2-4322-9397- > 198b60384262&u=https*3A*2F*2Fzonedstorage.io*2Fdocs*2Fgetting- > started*2Fnullblk__;JSUlJSUl!!EwVzqGoTKBqv- > 0DWAJBm!W9rOwMglqKKaCnQtFY5uxwBUSkjLX1lhAlPzBQnrfrGmWSeJ_ > 4jkgda9wa6GENWYaiGzIHIB7HHWUnS7si67aQ$ > > So this was really nice to learn about and pretty easy to use. > (Although I will say I saw all kinds of weird behavior like the device > saying it didn't support O_DIRECT, other wacky behavior, I believe due > to making config changes while the device was powered on.) > > With it, fio did return immediately after the error, return an error > code, print error messages above the json output, and set error to 5 > in the json for the job. > > Unfortunately, the same did not happen with the drive hang/abort/reset > I hit. Which must mean no I/O error was actually returned to fio. > Checking the fio latency log, that last read reported a latency of > 63.6 seconds. > > I'm guessing fio sat in wait_for_completion all of this time. For some > reason the drive's behavior wasn't enough to cause an I/O error - > perhaps it would have eventually. > > Any other thoughts on why the OS was willing to let this read go for > so long without an I/O error? I verified > /sys/module/nvme_core/parameters/io_timeout is 30, but > /sys/module/nvme_core/parameters/max_retries is 5, so maybe that is > the issue. > > Thanks, > Nick You could test your theory about max_retries by creating an NVMe fabrics loopback device backed by null_blk with error injection. Then try to access one of the bad blocks via the nvme device and see if the delay before fio sees the error depends on io_timeout and max_retries in the way that you expect. I'm cc'ing the list on this reply in case anyone else wants to chime in. Vincent