Issue with MDADM Raid10

nuitari-vger@xxxxxxxxxxx · Wed, 21 Aug 2024 02:54:27 +0000 ()

This scenario is happenning while I was testing a potential migration in a 
VM, thankfully no data is actually at risk. However the testing points 
towards a potential bug when multiple device are failed in a raid10.

This is on Ubuntu 24.04 with a handcompiled vanilla kernel version 6.10.6.

It was also tried on Ubuntu 22.04 (some version of 5.15) and stock Ubuntu 
24.04 (Some version of 6.8).

The dmesg output: https://nuitari.net/dmesg
The config: https://nuitari.net/config-kernel

The setup is 10 drives, 2 of which are 4.1x bigger than the rest. The 
partition used in the raid is #3.

mdadm --create /dev/md0 -l 10 -n 10 /dev/vd?3

# cat /proc/mdstat
Personalities : [raid10]
md0 : active raid10 vdj3[9] vdi3[8] vdh3[7] vdg3[6] vdf3[5] vde3[4] 
vdd3[3] vdc3[2] vdb3[1] vda3[0]
      78597120 blocks super 1.2 512K chunks 2 near-copies [10/10] 
[UUUUUUUUUU]

# mdadm --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Tue Aug 20 16:56:34 2024
        Raid Level : raid10
        Array Size : 78597120 (74.96 GiB 80.48 GB)
     Used Dev Size : 15719424 (14.99 GiB 16.10 GB)
      Raid Devices : 10
     Total Devices : 10
       Persistence : Superblock is persistent

       Update Time : Tue Aug 20 16:57:39 2024
             State : clean
    Active Devices : 10
   Working Devices : 10
    Failed Devices : 0
     Spare Devices : 0

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : resync

              Name : 0
              UUID : 8f69074e:137439a1:c307426c:4cd10069
            Events : 17

    Number   Major   Minor   RaidDevice State
       0     253        3        0      active sync set-A   /dev/vda3
       1     253       19        1      active sync set-B   /dev/vdb3
       2     253       35        2      active sync set-A   /dev/vdc3
       3     253       51        3      active sync set-B   /dev/vdd3
       4     253       67        4      active sync set-A   /dev/vde3
       5     253       83        5      active sync set-B   /dev/vdf3
       6     253       99        6      active sync set-A   /dev/vdg3
       7     253      115        7      active sync set-B   /dev/vdh3
       8     253      131        8      active sync set-A   /dev/vdi3
       9     253      147        9      active sync set-B   /dev/vdj3

for i in 1 2 3; do dd if=/dev/urandom of=garbage$1 bs=1G count=20; done ; 
sha1sum gar* > sums

# cat sums
034e10a97244ef762c0ce0389057829df1086d1e  garbage1
28359e08488bbd94ce434f37fdeb71a6ddceabcb  garbage2
afca3bc256deb149cb46adb85c22fb4716fe7656  garbage3

At this point I have a clean ext4 formatted array at /dev/md0 with some 
randomly generated test data.

For the test, I fail all of set-A drives, which in theory should still 
allow for a near=2 layout to continue operating:
# sync

# mdadm --fail /dev/md0 /dev/vd[acegi]3
mdadm: set /dev/vda3 faulty in /dev/md0
mdadm: set /dev/vdc3 faulty in /dev/md0
mdadm: set /dev/vde3 faulty in /dev/md0
mdadm: set /dev/vdg3 faulty in /dev/md0
mdadm: set /dev/vdi3 faulty in /dev/md0

# cat /proc/mdstat
Personalities : [raid10]
md0 : active raid10 vdj3[9] vdi3[8](F) vdh3[7] vdg3[6](F) vdf3[5] 
vde3[4](F) vdd3[3] vdc3[2](F) vdb3[1] vda3[0](F)
      78597120 blocks super 1.2 512K chunks 2 near-copies [10/5] 
[_U_U_U_U_U]

sysctl -w vm.drop_caches=3

# sha1sum garbage*
5ed07afa38dae9686ff7e4301a9c48da5215cc5c  garbage1
28359e08488bbd94ce434f37fdeb71a6ddceabcb  garbage2
afca3bc256deb149cb46adb85c22fb4716fe7656  garbage3

For some reason garbage1 is read back differently, this one time only.
Rerunning the sha1sum produces the expected checksum.

I can re-add the failed drives no problem. But sometimes, if I fail things 
more brutally:
#for i in /dev/vd?3; do mdadm --fail /dev/md0 $i; done

I might get EXT4 complaining, even though the raid is still
However in that case, a random selection of drives end up in the failed 
state:

    Number   Major   Minor   RaidDevice State
      14     253      131        0      active sync set-A   /dev/vdi3
       1     253       19        1      faulty   /dev/vdb3
      13     253       99        2      active sync set-A   /dev/vdg3
       3     253       51        3      faulty   /dev/vdd3
      12     253       67        4      faulty   /dev/vde3
       5     253       83        5      active sync set-B   /dev/vdf3
      11     253       35        6      faulty   /dev/vdc3
       7     253      115        7      active sync set-B   /dev/vdh3
      10     253        3        8      faulty   /dev/vda3
       9     253      147        9      active sync set-B   /dev/vdj3

# cat /proc/mdstat
Personalities : [raid10]
md0 : active raid10 vdi3[14] vdg3[13] vde3[12](F) vdc3[11](F) vda3[10](F) 
vdj3[9] vdh3[7] vdf3[5] vdd3[3](F) vdb3[1](F)
      78597120 blocks super 1.2 512K chunks 2 near-copies [10/5] 
[U_U__U_U_U]

Then I tried to unmount /dev/md0 and got the following dmesg entries
[19056.984999] Buffer I/O error on dev md0, logical block 9469952, lost 
sync page write
[19056.985009] JBD2: I/O error when updating journal superblock for md0-8.
[19056.985014] Aborting journal on device md0-8.
[19056.985019] Buffer I/O error on dev md0, logical block 9469952, lost 
sync page write
[19056.985025] JBD2: I/O error when updating journal superblock for md0-8.
[19056.985037] EXT4-fs error (device md0): ext4_put_super:1310: comm 
umount: Couldn't clean up the journal
[19056.985048] Buffer I/O error on dev md0, logical block 0, lost sync 
page write
[19056.985054] EXT4-fs (md0): I/O error while writing superblock
[19056.985059] EXT4-fs (md0): Remounting filesystem read-only

Even after readding the failed devices, mounting the filesystem fails:

[21146.796366] md: recovery of RAID array md0
[21215.559109] md: md0: recovery done.
[21225.310156] Buffer I/O error on dev md0, logical block 0, lost sync 
page write
[21225.310168] EXT4-fs (md0): I/O error while writing superblock
[21225.310176] EXT4-fs (md0): mount failed
[21225.310230] Aborting journal on device md0-8.
[21225.310238] Buffer I/O error on dev md0, logical block 9469952, lost 
sync page write
[21225.310244] JBD2: I/O error when updating journal superblock for md0-8.

But stopping /dev/md0 and re-assembling it works

# mdadm --stop /dev/md0
# mdadm --assemble /dev/md0 /dev/vd?3
mdadm: /dev/md0 has been started with 10 drives.
# cat /proc/mdstat
Personalities : [raid10]
md0 : active raid10 vdi3[14] vdj3[9] vda3[10] vdh3[7] vdb3[11] vdf3[5] 
vdc3[12] vdd3[15] vdg3[13] vde3[16]
      78597120 blocks super 1.2 512K chunks 2 near-copies [10/10] 
[UUUUUUUUUU]

Rechecksum the test data yields the correct results.

mdadm version 4.3 is used here.

If this isn't the right venue for this, please point me towards the 
correct place.

As this is in a test environment, I can perform dangerous tests.