RE: Best practices for handling drive failures during a run?

Vincent Fu <vincent.fu@xxxxxxxxxxx> · Sat, 10 Sep 2022 03:56:04 +0000

> -----Original Message-----
> From: Nick Neumann [mailto:nick@xxxxxxxxxxxxxxxx]
> Sent: Friday, September 9, 2022 1:36 PM
> To: Vincent Fu <vincent.fu@xxxxxxxxxxx>
> Subject: Re: Best practices for handling drive failures during a run?
> 
> On Thu, Sep 8, 2022 at 9:59 AM Vincent Fu <vincent.fu@xxxxxxxxxxx>
> wrote:
> > The null_blk device supports error injection via the badblocks configfs
> > variable. So you could use it for testing. There is a help guide for setting
> up
> > null_blk devices via configfs at
> https://urldefense.com/v3/__https://protect2.fireeye.com/v1/url?k=bd
> b74897-dccaa0d0-bdb6c3d8-74fe485fff30-
> 0650a1ccc4e5ca1a&q=1&e=291075fc-ffa2-4322-9397-
> 198b60384262&u=https*3A*2F*2Fzonedstorage.io*2Fdocs*2Fgetting-
> started*2Fnullblk__;JSUlJSUl!!EwVzqGoTKBqv-
> 0DWAJBm!W9rOwMglqKKaCnQtFY5uxwBUSkjLX1lhAlPzBQnrfrGmWSeJ_
> 4jkgda9wa6GENWYaiGzIHIB7HHWUnS7si67aQ$
> 
> So this was really nice to learn about and pretty easy to use.
> (Although I will say I saw all kinds of weird behavior like the device
> saying it didn't support O_DIRECT, other wacky behavior, I believe due
> to making config changes while the device was powered on.)
> 
> With it, fio did return immediately after the error, return an error
> code, print error messages above the json output, and set error to 5
> in the json for the job.
> 
> Unfortunately, the same did not happen with the drive hang/abort/reset
> I hit. Which must mean no I/O error was actually returned to fio.
> Checking the fio latency log, that last read reported a latency of
> 63.6 seconds.
> 
> I'm guessing fio sat in wait_for_completion all of this time. For some
> reason the drive's behavior wasn't enough to cause an I/O error -
> perhaps it would have eventually.
> 
> Any other thoughts on why the OS was willing to let this read go for
> so long without an I/O error? I verified
> /sys/module/nvme_core/parameters/io_timeout is 30, but
> /sys/module/nvme_core/parameters/max_retries is 5, so maybe that is
> the issue.
> 
> Thanks,
> Nick

You could test your theory about max_retries by creating an NVMe fabrics
loopback device backed by null_blk with error injection. Then try to access one
of the bad blocks via the nvme device and see if the delay before fio sees
the error depends on io_timeout and max_retries in the way that you expect.

I'm cc'ing the list on this reply in case anyone else wants to chime in.

Vincent