09.06.2022 1:48, Roger Heflin пишет:
You might want to see if specific disk devices are getting
reset/rebooted, the more often they are getting reset/rebooted the
higher chance of data loss. The vendor's solution in the case I know
about was to treat unrequested device resets/reboots as a failing
device, and disable and replace it.
How to detect these resets/reboots? Is there is a "counter" in kernel
or in NVMe itself?
I don't know if this is what is causing your issue or not, but it is a
possible issue, and an issue that is hard to write code to handle.
We see log messages explicitly reporting an I/O error and data not being
written:
[Tue Jun 7 09:58:45 2022] I/O error, dev nvme0n1, sector 538918912 op
0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[Tue Jun 7 09:58:45 2022] I/O error, dev nvme0n1, sector 538988816 op
0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[Tue Jun 7 09:58:48 2022] I/O error, dev nvme0n1, sector 126839568 op
0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[Tue Jun 7 09:58:48 2022] I/O error, dev nvme0n1, sector 126888224 op
0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[Tue Jun 7 09:58:48 2022] I/O error, dev nvme0n1, sector 126894288 op
0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
I think that is enough reason to mark array member as failed as it has
inconsistend data now.