On Wed, Nov 09, 2022 at 12:58:55PM +0800, Alexander Hartner wrote: > We have dealing with a problem where a NVME drive fails every so > often. More than it really should. While we are trying to make sense > of the hardware issue, we are also looking at the recovery options. > > Currently we are using Ubuntu 20.04 LTS on XFS with a single NVME > disk. If the disk fails the following error is reported. > > Nov 6, 2022 @ 20:27:12.000 [1095930.104279] nvme nvme0: controller > is down; will reset: CSTS=0x3, PCI_STATUS=0x10 > Nov 6, 2022 @ 20:27:12.000 [1095930.451711] nvme nvme0: 64/0/0 > default/read/poll queues > Nov 6, 2022 @ 20:27:12.000 [1095930.453846] blk_update_request: I/O > error, dev nvme0n1, sector 34503744 op 0x1:(WRITE) flags 0x800 > phys_seg 1 prio class 0 > > And the system becomes completely unresponsive. What is the system stuck on? The output of sysrq-w will would help us understand what is happening as a result of this failed NVMe drive. > I am looking for a solution to stop the system when this happens, so > the other nodes in our cluster can carry the work. However since the > system is unresponsive and the disk presumably in read-only mode we > stuck in a sort of zombie state, where the processes are still running > but don't have access to the disk. On EXT3/4 there is an option to > take the system down. On XFS, there are some configurable error behaviours that can be changed under /sys/fs/xfs/<dev>/error/metadata. See the Error Handling of the linux kernel admin guide XFS page: https://docs.kernel.org/admin-guide/xfs.html#error-handling I'm guessing that the behaviour you are seeing is that metadata write EIO errors default to "retry until unmount" behaviour (i.e. retry writes forever, fail_at_unmount = true). > Is there an equivalent for XFS ? I didn't find anything similar on the > XFS man page. Hmmmm. It might be worth documenting this sysfs stuff in xfs(5), not just the mount options supported... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx