Re: Misbehavior of md-raid RAID on failed NVMe.

Pavel <pavel2000@xxxxxx> · Thu, 9 Jun 2022 13:43:04 +0700

09.06.2022 1:48, Roger Heflin пишет:
You might want to see if specific disk devices are getting 
reset/rebooted, the more often they are getting reset/rebooted the 
higher chance of data loss. The vendor's solution in the case I know 
about was to treat unrequested device resets/reboots as a failing 
device, and disable and replace it.
How to detect these resets/reboots?  Is there is a "counter" in kernel 
or in NVMe itself?

I don't know if this is what is causing your issue or not, but it is a 
possible issue, and an issue that is hard to write code to handle.

We see log messages explicitly reporting an I/O error and data not being 
written:

[Tue Jun  7 09:58:45 2022] I/O error, dev nvme0n1, sector 538918912 op 
0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[Tue Jun  7 09:58:45 2022] I/O error, dev nvme0n1, sector 538988816 op 
0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[Tue Jun  7 09:58:48 2022] I/O error, dev nvme0n1, sector 126839568 op 
0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[Tue Jun  7 09:58:48 2022] I/O error, dev nvme0n1, sector 126888224 op 
0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[Tue Jun  7 09:58:48 2022] I/O error, dev nvme0n1, sector 126894288 op 
0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0

I think that is enough reason to mark array member as failed as it has 
inconsistend data now.