Re: Best practices for handling drive failures during a run?

Nick Neumann <nick@xxxxxxxxxxxxxxxx> · Fri, 9 Sep 2022 23:03:46 -0500

On Fri, Sep 9, 2022 at 10:56 PM Vincent Fu <vincent.fu@xxxxxxxxxxx> wrote:
> You could test your theory about max_retries by creating an NVMe fabrics
> loopback device backed by null_blk with error injection. Then try to access one
> of the bad blocks via the nvme device and see if the delay before fio sees
> the error depends on io_timeout and max_retries in the way that you expect.

Oooh, that sounds great. Thanks for the suggestion. I'll get to it
Monday if I don't find some time this weekend.

Coincidentally, one of the things I found googling was someone using
NVMe fabrics complaining that nvme_core/io_timeout and
nvme_core/max_retries were not being honored. It was from 2019 but
seemed relevant.(https://lore.kernel.org/all/EA2BFA4D4BAD49629F533A98F74DCE42@alyakaslap/T/#m26b5c91ec59de5159961a26a6cb0340c32a05ec9)

I'll report back with what I see.

Thanks,
Nick