Re: mdraid: raid1 and iscsi-multipath devices - never faults but should!

Jack Wang <jack.wang.usish@xxxxxxxxx> · Thu, 22 Oct 2020 13:44:23 +0200



Thomas Rosenstein <thomas.rosenstein@xxxxxxxxxxxxxxxx> 于2020年10月22日周四 下午12:28写道：
>
> Hello,
>
> I'm trying todo something interesting, the structure looks like this:
>
> xfs
> - mdraid
>    - multipath (with no_path_queue = fail)
>      - iscsi path 1
>      - iscsi path 2
>    - multipath (with no_path_queue = fail)
>      - iscsi path 1
>      - iscsi path 2
>
> During normal operation everything looks good, once a path fails (i.e.
> iscsi target is removed), the array goes to degraded, if the path comes
> back nothing happens.
>
> Q1) Can I enable auto recovery for failed devices?
>
> If the device is readded manually (or by software) everything resyncs
> and it works again. As all should be.
>
> If BOTH devices fail at the same time (worst case scenario) it gets
> wonky. I would expect a total hang (as with iscsi, and multipath
> queue_no_path)
>
> 1) XFS reports Input/Output error
> 2) dmesg has logs like:
>
> [Thu Oct 22 09:25:28 2020] Buffer I/O error on dev md127, logical block
> 41472, async page read
> [Thu Oct 22 09:25:28 2020] Buffer I/O error on dev md127, logical block
> 41473, async page read
> [Thu Oct 22 09:25:28 2020] Buffer I/O error on dev md127, logical block
> 41474, async page read
> [Thu Oct 22 09:25:28 2020] Buffer I/O error on dev md127, logical block
> 41475, async page read
> [Thu Oct 22 09:25:28 2020] Buffer I/O error on dev md127, logical block
> 41476, async page read
> [Thu Oct 22 09:25:28 2020] Buffer I/O error on dev md127, logical block
> 41477, async page read
> [Thu Oct 22 09:25:28 2020] Buffer I/O error on dev md127, logical block
> 41478, async page read
>
> 3) mdadm --detail /dev/md127 shows:
>
> /dev/md127:
>             Version : 1.2
>       Creation Time : Wed Oct 21 17:25:22 2020
>          Raid Level : raid1
>          Array Size : 96640 (94.38 MiB 98.96 MB)
>       Used Dev Size : 96640 (94.38 MiB 98.96 MB)
>        Raid Devices : 2
>       Total Devices : 2
>         Persistence : Superblock is persistent
>
>         Update Time : Thu Oct 22 09:23:35 2020
>               State : clean, degraded
>      Active Devices : 1
>     Working Devices : 1
>      Failed Devices : 1
>       Spare Devices : 0
>
> Consistency Policy : resync
>
>                Name : v-b08c6663-7296-4c66-9faf-ac687
>                UUID : cc282a5c:59a499b3:682f5e6f:36f9c490
>              Events : 122
>
>      Number   Major   Minor   RaidDevice State
>         0     253        2        0      active sync   /dev/dm-2
>         -       0        0        1      removed
>
>         1     253        3        -      faulty   /dev/dm-
>
> 4) I can read from /dev/md127, but only however much is in the buffer
> (see above dmesg logs)
>
>
> In my opinion this should happen, or at least should be configurable.
> I expect:
> 1) XFS hangs indefinitly (like multipath queue_no_path)
> 2) mdadm shows FAULTED as State

>
> Q2) Can this be configured in any way?
you can enable the last device to fail
9a567843f7ce ("md: allow last device to be forcibly removed from RAID1/RAID10.")
>
> After BOTH paths are recovered, nothing works anymore, and the raid
> doesn't recover automatically.
> Only a complete unmount and stop followed by an assemble and mount makes
> the raid function again.
>
> Q3) Is that expected behavior?
>
> Thanks
> Thomas Rosenstein