Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Song Liu <song@xxxxxxxxxx> · Tue, 6 Feb 2024 00:07:36 -0800

On Thu, Jan 25, 2024 at 12:31 PM Dan Moulding <dan@xxxxxxxx> wrote:
>
> Hi Junxiao,
>
> I first noticed this problem the next day after I had upgraded some
> machines to the 6.7.1 kernel. One of the machines is a backup server.
> Just a few hours after the upgrade to 6.7.1, it started running its
> overnight backup jobs. Those backup jobs hung part way through. When I
> tried to check on the backups in the morning, I found the server
> mostly unresponsive. I could SSH in but most shell commands would just
> hang. I was able to run top and see that the md0_raid5 kernel thread
> was using 100% CPU. I tried to reboot the server, but it wasn't able
> to successfully shutdown and eventually I had to hard reset it.
>
> The next day, the same sequence of events occurred on that server
> again when it tried to run its backup jobs. Then the following day, I
> experienced another hang on a different machine, with a similar RAID-5
> configuration. That time I was scp'ing a large file to a virtual
> machine whose image was stored on the RAID-5 array. Part way through
> the transfer scp reported that the transfer had stalled. I checked top
> on that machine and found once again that the md0_raid5 kernel thread
> was using 100% CPU.
>
> Yesterday I created a fresh Fedora 39 VM for the purposes of
> reproducing this problem in a different environment (the other two
> machines are both Gentoo servers running v6.7 kernels straight from
> the stable trees with a custom kernel configuration). I am able to
> reproduce the problem on Fedora 39 running both the v6.6.13 stable
> tree kernel code and the Fedora 39 6.6.13 distribution kernel.
>
> On this Fedora 39 VM, I created a 1GiB LVM volume to use as the RAID-5
> journal from space on the "boot" disk. Then I attached 3 additional
> 100 GiB virtual disks and created the RAID-5 from those 3 disks and
> the write-journal device. I then created a new LVM volume group from
> the md0 array and created one LVM logical volume named "data", using
> all but 64GiB of the available VG space. I then created an ext4 file
> system on the "data" volume, mounted it, and used "dd" to copy 1MiB
> blocks from /dev/urandom to a file on the "data" file system, and just
> let it run. Eventually "dd" hangs and top shows that md0_raid5 is
> using 100% CPU.
>
> Here is an example command I just ran, which has hung after writing
> 4.1 GiB of random data to the array:
>
> test@localhost:~$ dd if=/dev/urandom bs=1M of=/data/random.dat status=progress
> 4410310656 bytes (4.4 GB, 4.1 GiB) copied, 324 s, 13.6 MB/s

Update on this..

I haven't been testing the following config md-6.9 branch [1].
The array works fine afaict.

Dan, could you please run the test on this branch
(83cbdaf61b1ab9cdaa0321eeea734bc70ca069c8)?

Thanks,
Song

[1] https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=md-6.9

[root@eth50-1 ~]# lsblk
NAME                             MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
sr0                               11:0    1 1024M  0 rom
vda                              253:0    0   32G  0 disk
├─vda1                           253:1    0    2G  0 part  /boot
└─vda2                           253:2    0   30G  0 part  /
nvme2n1                          259:0    0   50G  0 disk
└─md0                              9:0    0  100G  0 raid5
  ├─vg--md--data-md--data-real   250:2    0   50G  0 lvm
  │ ├─vg--md--data-md--data      250:1    0   50G  0 lvm   /mnt/2
  │ └─vg--md--data-snap          250:4    0   50G  0 lvm
  └─vg--md--data-snap-cow        250:3    0   49G  0 lvm
    └─vg--md--data-snap          250:4    0   50G  0 lvm
nvme0n1                          259:1    0   50G  0 disk
└─md0                              9:0    0  100G  0 raid5
  ├─vg--md--data-md--data-real   250:2    0   50G  0 lvm
  │ ├─vg--md--data-md--data      250:1    0   50G  0 lvm   /mnt/2
  │ └─vg--md--data-snap          250:4    0   50G  0 lvm
  └─vg--md--data-snap-cow        250:3    0   49G  0 lvm
    └─vg--md--data-snap          250:4    0   50G  0 lvm
nvme1n1                          259:2    0   50G  0 disk
└─md0                              9:0    0  100G  0 raid5
  ├─vg--md--data-md--data-real   250:2    0   50G  0 lvm
  │ ├─vg--md--data-md--data      250:1    0   50G  0 lvm   /mnt/2
  │ └─vg--md--data-snap          250:4    0   50G  0 lvm
  └─vg--md--data-snap-cow        250:3    0   49G  0 lvm
    └─vg--md--data-snap          250:4    0   50G  0 lvm
nvme4n1                          259:3    0    2G  0 disk
nvme3n1                          259:4    0   50G  0 disk
└─vg--data-lv--journal           250:0    0  512M  0 lvm
  └─md0                            9:0    0  100G  0 raid5
    ├─vg--md--data-md--data-real 250:2    0   50G  0 lvm
    │ ├─vg--md--data-md--data    250:1    0   50G  0 lvm   /mnt/2
    │ └─vg--md--data-snap        250:4    0   50G  0 lvm
    └─vg--md--data-snap-cow      250:3    0   49G  0 lvm
      └─vg--md--data-snap        250:4    0   50G  0 lvm
nvme5n1                          259:5    0    2G  0 disk
nvme6n1                          259:6    0    4G  0 disk
[root@eth50-1 ~]# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md0 : active raid5 nvme2n1[4] dm-0[3](J) nvme1n1[1] nvme0n1[0]
      104790016 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]

unused devices: <none>
[root@eth50-1 ~]# mount | grep /mnt/2
/dev/mapper/vg--md--data-md--data on /mnt/2 type ext4 (rw,relatime,stripe=256)