Re: Detecting disk failures on XFS

Carlos Maiolino <cem@xxxxxxxxxx> · Wed, 9 Nov 2022 12:22:44 +0100

On Wed, Nov 09, 2022 at 12:58:55PM +0800, Alexander Hartner wrote:
> We have dealing with a problem where a NVME drive fails every so
> often. More than it really should. While we are trying to make sense
> of the hardware issue, we are also looking at the recovery options.
> 
> Currently we are using Ubuntu 20.04 LTS on XFS with a single NVME
> disk. If the disk fails the following error is reported.
> 
> Nov 6, 2022 @ 20:27:12.000    [1095930.104279] nvme nvme0: controller
> is down; will reset: CSTS=0x3, PCI_STATUS=0x10
> Nov 6, 2022 @ 20:27:12.000    [1095930.451711] nvme nvme0: 64/0/0
> default/read/poll queues
> Nov 6, 2022 @ 20:27:12.000    [1095930.453846] blk_update_request: I/O
> error, dev nvme0n1, sector 34503744 op 0x1:(WRITE) flags 0x800
> phys_seg 1 prio class 0
> 
> And the system becomes completely unresponsive.
> 
> I am looking for a solution to stop the system when this happens, so
> the other nodes in our cluster can carry the work. However since the
> system is unresponsive and the disk presumably in read-only mode we
> stuck in a sort of zombie state, where the processes are still running
> but don't have access to the disk. On EXT3/4 there is an option to
> take the system down.
> 

XFS doesn't work like that, it will either shutdown the filesystem or keep
trying the IO waiting the storage to come back in case of transient IO errors.
We don't keep a filesystem alive if it might me inconsistent.

> Is there an equivalent for XFS ? I didn't find anything similar on the
> XFS man page.
> 
> Also any other suggestions to better handle this ?

Look at "Error handling" section at kernel's Documentation:

https://docs.kernel.org/admin-guide/xfs.html

This might help. But I don't know how it translates to the distro kernel you
are using though.

Cheers.

-- 
Carlos Maiolino