Detecting disk failures on XFS

Alexander Hartner <thahartner@xxxxxxxxx> · Wed, 9 Nov 2022 12:58:55 +0800

We have dealing with a problem where a NVME drive fails every so
often. More than it really should. While we are trying to make sense
of the hardware issue, we are also looking at the recovery options.

Currently we are using Ubuntu 20.04 LTS on XFS with a single NVME
disk. If the disk fails the following error is reported.

Nov 6, 2022 @ 20:27:12.000    [1095930.104279] nvme nvme0: controller
is down; will reset: CSTS=0x3, PCI_STATUS=0x10
Nov 6, 2022 @ 20:27:12.000    [1095930.451711] nvme nvme0: 64/0/0
default/read/poll queues
Nov 6, 2022 @ 20:27:12.000    [1095930.453846] blk_update_request: I/O
error, dev nvme0n1, sector 34503744 op 0x1:(WRITE) flags 0x800
phys_seg 1 prio class 0

And the system becomes completely unresponsive.

I am looking for a solution to stop the system when this happens, so
the other nodes in our cluster can carry the work. However since the
system is unresponsive and the disk presumably in read-only mode we
stuck in a sort of zombie state, where the processes are still running
but don't have access to the disk. On EXT3/4 there is an option to
take the system down.

errors={continue|remount-ro|panic}
Define the behavior when an error is encountered.  (Either ignore
errors and just mark the filesystem erroneous and continue, or remount
the filesystem read-only, or panic and halt the system.)  The default
is set in the filesystem superblock, and can be changed using
tune2fs(8).

Is there an equivalent for XFS ? I didn't find anything similar on the
XFS man page.

Also any other suggestions to better handle this ?