On 22 Oct 2020, at 13:44, Jack Wang wrote:
Thomas Rosenstein <thomas.rosenstein@xxxxxxxxxxxxxxxx>
于2020年10月22日周四 下午12:28写道:
Hello,
I'm trying todo something interesting, the structure looks like this:
xfs
- mdraid
- multipath (with no_path_queue = fail)
- iscsi path 1
- iscsi path 2
- multipath (with no_path_queue = fail)
- iscsi path 1
- iscsi path 2
During normal operation everything looks good, once a path fails
(i.e.
iscsi target is removed), the array goes to degraded, if the path
comes
back nothing happens.
Q1) Can I enable auto recovery for failed devices?
If the device is readded manually (or by software) everything resyncs
and it works again. As all should be.
If BOTH devices fail at the same time (worst case scenario) it gets
wonky. I would expect a total hang (as with iscsi, and multipath
queue_no_path)
1) XFS reports Input/Output error
2) dmesg has logs like:
[Thu Oct 22 09:25:28 2020] Buffer I/O error on dev md127, logical
block
41472, async page read
[Thu Oct 22 09:25:28 2020] Buffer I/O error on dev md127, logical
block
41473, async page read
[Thu Oct 22 09:25:28 2020] Buffer I/O error on dev md127, logical
block
41474, async page read
[Thu Oct 22 09:25:28 2020] Buffer I/O error on dev md127, logical
block
41475, async page read
[Thu Oct 22 09:25:28 2020] Buffer I/O error on dev md127, logical
block
41476, async page read
[Thu Oct 22 09:25:28 2020] Buffer I/O error on dev md127, logical
block
41477, async page read
[Thu Oct 22 09:25:28 2020] Buffer I/O error on dev md127, logical
block
41478, async page read
3) mdadm --detail /dev/md127 shows:
/dev/md127:
Version : 1.2
Creation Time : Wed Oct 21 17:25:22 2020
Raid Level : raid1
Array Size : 96640 (94.38 MiB 98.96 MB)
Used Dev Size : 96640 (94.38 MiB 98.96 MB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Thu Oct 22 09:23:35 2020
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 1
Spare Devices : 0
Consistency Policy : resync
Name : v-b08c6663-7296-4c66-9faf-ac687
UUID : cc282a5c:59a499b3:682f5e6f:36f9c490
Events : 122
Number Major Minor RaidDevice State
0 253 2 0 active sync /dev/dm-2
- 0 0 1 removed
1 253 3 - faulty /dev/dm-
4) I can read from /dev/md127, but only however much is in the buffer
(see above dmesg logs)
In my opinion this should happen, or at least should be configurable.
I expect:
1) XFS hangs indefinitly (like multipath queue_no_path)
2) mdadm shows FAULTED as State
Q2) Can this be configured in any way?
you can enable the last device to fail
9a567843f7ce ("md: allow last device to be forcibly removed from
RAID1/RAID10.")
That did work, last device moved into faulted. Is there a way to recover
from that? or is the array completely broken at that point?
I tried to re-add the first device after it's back up, but that leads to
a Recovery / Synchronize Loop
btw. kernel 5.4.60
After BOTH paths are recovered, nothing works anymore, and the raid
doesn't recover automatically.
Only a complete unmount and stop followed by an assemble and mount
makes
the raid function again.
Q3) Is that expected behavior?
Thanks
Thomas Rosenstein