Re: Best practices for handling drive failures during a run?

Nick Neumann <nick@xxxxxxxxxxxxxxxx> · Thu, 22 Sep 2022 15:29:32 -0500

On Fri, Sep 9, 2022 at 10:56 PM Vincent Fu <vincent.fu@xxxxxxxxxxx> wrote:
> You could test your theory about max_retries by creating an NVMe fabrics
> loopback device backed by null_blk with error injection. Then try to access one
> of the bad blocks via the nvme device and see if the delay before fio sees
> the error depends on io_timeout and max_retries in the way that you expect.

I finally got a chance to try this. I had to learn enough about nvme
fabrics and combine it with the null_blk bad blocks as before. I think
I'm doing everything right, but I'm wondering if I missed something,
because the behavior trying to write to the nvme fabric device which
has a null_blk device "backing" it is the same as trying to write to
the null_blk device directly - immediate error and termination of fio.
I wonder if this should really surprise me though since the underlying
device doesn't experience a timeout and its error is immediately
propagated to the client over the nvme fabric (apologies if I'm using
any terminology wrong). This was my basic setup:

sudo modprobe null_blk nr_devices=0
sudo mkdir /sys/kernel/config/nullb/nullb0
echo 1 | sudo tee -a /sys/kernel/config/nullb/nullb0/memory_backed
echo "+1-100" | sudo tee -a /sys/kernel/config/nullb/nullb0/badblocks
echo 1 | sudo tee -a /sys/kernel/config/nullb/nullb0/power
# First fio run directly on null device returns error immediately
sudo fio --filename=/dev/nullb0 --name=job --ioengine=libaio
--direct=1 --size=1M --rw=rw --rwmixwrite=100 --bs=128K
sudo modprobe nvme_tcp
sudo modprobe nvmet-tcp
sudo mkdir /sys/kernel/config/nvmet/subsystems/nvmet-test
cd /sys/kernel/config/nvmet/subsystems/nvmet-test
echo 1 | sudo tee -a attr_allow_any_host
sudo mkdir namespaces/1
cd namespaces/1
echo -n /dev/nullb0 | sudo tee -a device_path
echo 1 | sudo tee -a enable
mkdir /sys/kernel/config/nvmet/ports/1
sudo mkdir /sys/kernel/config/nvmet/ports/1
echo 127.0.0.1 | sudo tee -a /sys/kernel/config/nvmet/ports/1/addr_traddr
echo tcp | sudo tee -a /sys/kernel/config/nvmet/ports/1/addr_trtype
echo 4420 | sudo tee -a /sys/kernel/config/nvmet/ports/1/addr_trsvcid
echo ipv4 | sudo tee -a /sys/kernel/config/nvmet/ports/1/addr_adrfam
sudo ln -s /sys/kernel/config/nvmet/subsystems/nvmet-test
/sys/kernel/config/nvmet/ports/1/subsystems/nvmet-test
sudo dmesg | grep nvmet_tcp
sudo modprobe nvme
sudo nvme discover -t tcp -a 127.0.0.1 -s 4420
sudo nvme connect -t tcp -n nvmet-test -a 127.0.0.1 -s 4420
sudo nvme list
cat /proc/partitions | grep nvme
# This guy also returns error immediately
sudo fio --filename=/dev/nvme0n1 --name=job --ioengine=libaio
--direct=1 --size=1M --rw=rw --rwmixwrite=100 --bs=128K