> On Dec 8, 2020, at 8:17 PM, Song Liu <songliubraving@xxxxxx> wrote: > > Hi Matthew, > >> On Dec 8, 2020, at 7:46 PM, Matthew Ruffell <matthew.ruffell@xxxxxxxxxxxxx> wrote: >> >> Hello, >> >> I recently backported the following patches into the Ubuntu stable kernels: >> >> md: add md_submit_discard_bio() for submitting discard bio >> md/raid10: extend r10bio devs to raid disks >> md/raid10: pull codes that wait for blocked dev into one function >> md/raid10: improve raid10 discard request >> md/raid10: improve discard request for far layout I reproduced the issue with 5.10-rc7. With md/raid10, the issue is fixed when I revert the md/raid10 patches. >> dm raid: fix discard limits for raid1 and raid10 >> dm raid: remove unnecessary discard limits for raid10 Since 5.10 official will be released this weekend, I am afraid we have to revert these changes for 5.10. I just sent a patch to revert f0e90b6c663a ("dm raid: remove unnecessary discard limits for raid10") I will send pull request to revert the md/raid10 patches. Thanks, Song > > Thanks for the report! > > Hi Xiao, > > Could you please take a look at this and let me know soon? We need to fix > this before 5.10 official release. > > Thanks, > Song > >> >> and this morning, a user reported the following downstream bug: >> >> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/ >> >> Their weekly cronjob that runs fstrim had run, and their raid10 array has >> extensive data corruption. >> >> The issue is reproducible on the latest 5.10-rc7 mainline kernel, steps are >> below. >> >> I used a m5d.4xlarge instance on AWS to ultilise 2x 300GB SSDs that support >> block discard. You will want to select small disks to lower the time needed >> to reproduce. >> >> $ uname -rv >> 5.10.0-rc7+ #1 SMP Wed Dec 9 01:15:27 UTC 2020 >> >> Create a raid10 array, with LVM: >> >> $ lsblk >> NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT >> nvme0n1 259:0 0 8G 0 disk >> └─nvme0n1p1 259:1 0 8G 0 part / >> nvme1n1 259:2 0 279.4G 0 disk >> nvme2n1 259:3 0 279.4G 0 disk >> >> $ sudo -s >> # mdadm -C -v -l10 -n2 -N "lv-raid" -R /dev/md0 /dev/nvme1n1 /dev/nvme2n1 >> mdadm: layout defaults to n2 >> mdadm: layout defaults to n2 >> mdadm: chunk size defaults to 512K >> mdadm: size set to 292836352K >> mdadm: automatically enabling write-intent bitmap on large array >> mdadm: Defaulting to version 1.2 metadata >> mdadm: array /dev/md0 started. >> # pvcreate -ff -y /dev/md0 >> Physical volume "/dev/md0" successfully created. >> # vgcreate -f -y VolGroup /dev/md0 >> Volume group "VolGroup" successfully created >> # lvcreate -n root -L 100G -ay -y VolGroup >> Logical volume "root" created. >> # mkfs.ext4 /dev/VolGroup/root >> mke2fs 1.44.1 (24-Mar-2018) >> Discarding device blocks: done >> Creating filesystem with 26214400 4k blocks and 6553600 inodes >> Filesystem UUID: d7be2e14-fa4d-4489-884b-3bef63b1e1db >> Superblock backups stored on blocks: >> 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, >> 4096000, 7962624, 11239424, 20480000, 23887872 >> >> Allocating group tables: done >> Writing inode tables: done >> Creating journal (131072 blocks): done >> Writing superblocks and filesystem accounting information: done >> # mount /dev/VolGroup/root /mnt >> >> Next, wait for the disk check to complete, 25 minutes on m5d.4xlarge instance. >> >> # cat /proc/mdstat >> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] >> md0 : active raid10 nvme2n1[1] nvme1n1[0] >> 292836352 blocks super 1.2 2 near-copies [2/2] [UU] >> [==>..................] resync = 12.0% (35211392/292836352) finish=21.4min speed=200340K/sec >> bitmap: 3/3 pages [12KB], 65536KB chunk >> >> unused devices: <none> >> # cat /sys/block/md0/md/mismatch_cnt >> 76918016 >> >> # cat /proc/mdstat >> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] >> md0 : active raid10 nvme2n1[1] nvme1n1[0] >> 292836352 blocks super 1.2 2 near-copies [2/2] [UU] >> bitmap: 0/3 pages [0KB], 65536KB chunk >> >> unused devices: <none> >> # cat /sys/block/md0/md/mismatch_cnt >> 582330240 >> >> Now that the check is complete, create a file, sync and delete it: >> >> # dd if=/dev/zero of=/mnt/data.raw bs=4K count=1M >> 1048576+0 records in >> 1048576+0 records out >> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.95974 s, 1.1 GB/s >> # sync >> # rm /mnt/data.raw >> >> Perform a check: >> >> # echo check > /sys/block/md0/md/sync_action >> >> Again, wait 25 minutes for it to complete: >> >> # cat /proc/mdstat >> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] >> md0 : active raid10 nvme1n1[1] nvme2n1[0] >> 292836352 blocks super 1.2 2 near-copies [2/2] [UU] >> [==>..................] check = 13.7% (40356224/292836352) finish=20.8min speed=201707K/sec >> bitmap: 0/3 pages [0KB], 65536KB chunk >> >> unused devices: <none> >> # cat /sys/block/md0/md/mismatch_cnt >> 1469696 >> >> # cat /proc/mdstat >> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] >> md0 : active raid10 nvme1n1[1] nvme2n1[0] >> 292836352 blocks super 1.2 2 near-copies [2/2] [UU] >> bitmap: 0/3 pages [0KB], 65536KB chunk >> >> unused devices: <none> >> # cat /sys/block/md0/md/mismatch_cnt >> 1469696 >> >> Now, perform the fstrim: >> >> # fstrim /mnt --verbose >> /mnt: 97.9 GiB (105089236992 bytes) trimmed >> >> Go for another check: >> >> # echo check >/sys/block/md0/md/sync_action >> # cat /proc/mdstat >> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] >> md0 : active raid10 nvme1n1[1] nvme2n1[0] >> 292836352 blocks super 1.2 2 near-copies [2/2] [UU] >> [========>............] check = 40.3% (118270848/292836352) finish=14.4min speed=200963K/sec >> bitmap: 0/3 pages [0KB], 65536KB chunk >> >> unused devices: <none> >> # cat /sys/block/md0/md/mismatch_cnt >> 205324928 >> >> # cat /proc/mdstat >> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] >> md0 : active raid10 nvme1n1[1] nvme2n1[0] >> 292836352 blocks super 1.2 2 near-copies [2/2] [UU] >> bitmap: 0/3 pages [0KB], 65536KB chunk >> >> unused devices: <none> >> # cat /sys/block/md0/md/mismatch_cnt >> 205324928 >> >> Now, we need to take the raid10 array down, and perform a fsck on one disk at >> a time: >> >> # umount /mnt >> # vgchange -a n /dev/VolGroup >> 0 logical volume(s) in volume group "VolGroup" now active >> # mdadm --stop /dev/md0 >> mdadm: stopped /dev/md0 >> >> Let's do first disk; >> >> # mdadm --assemble /dev/md127 /dev/nvme1n1 >> mdadm: /dev/md1 assembled from 1 drive - need all 2 to start it (use --run to insist). >> # mdadm --run /dev/md127 >> mdadm: started array /dev/md/lv-raid >> # vgchange -a y /dev/VolGroup >> 1 logical volume(s) in volume group "VolGroup" now active >> # fsck.ext4 -n -f /dev/VolGroup/root >> e2fsck 1.44.1 (24-Mar-2018) >> Pass 1: Checking inodes, blocks, and sizes >> Pass 2: Checking directory structure >> Pass 3: Checking directory connectivity >> Pass 4: Checking reference counts >> Pass 5: Checking group summary information >> /dev/VolGroup/root: 11/6553600 files (0.0% non-contiguous), 557848/26214400 blocks >> # vgchange -a n /dev/VolGroup >> 0 logical volume(s) in volume group "VolGroup" now active >> # mdadm --stop /dev/md127 >> mdadm: stopped /dev/md127 >> >> The second disk: >> >> # mdadm --assemble /dev/md127 /dev/nvme2n1 >> mdadm: /dev/md1 assembled from 1 drive - need all 2 to start it (use --run to insist). >> # mdadm --run /dev/md127 >> mdadm: started array /dev/md/lv-raid >> # vgchange -a y /dev/VolGroup >> 1 logical volume(s) in volume group "VolGroup" now active >> # fsck.ext4 -n -f /dev/VolGroup/root >> e2fsck 1.44.1 (24-Mar-2018) >> Resize inode not valid. Recreate? no >> >> Pass 1: Checking inodes, blocks, and sizes >> Inode 7 has illegal block(s). Clear? no >> >> Illegal indirect block (1714656753) in inode 7. IGNORED. >> Error while iterating over blocks in inode 7: Illegal indirect block found >> >> /dev/VolGroup/root: ********** WARNING: Filesystem still has errors ********** >> >> e2fsck: aborted >> >> /dev/VolGroup/root: ********** WARNING: Filesystem still has errors ********** >> >> # vgchange -a n /dev/VolGroup >> 0 logical volume(s) in volume group "VolGroup" now active >> # mdadm --stop /dev/md127 >> mdadm: stopped /dev/md127 >> >> There are no panics or anything in dmesg. The directory structure of the first >> disk is intact, but the second disk only has Lost+Found present. >> >> I can confirm it is the patches listed at the top of the email, but I have not >> had an opportunity to bisect to find the exact root cause. I will do that once >> we confirm what Ubuntu stable kernels are affected and begin reverting the >> patches. >> >> Let me know if you need any more details. >> >> Thanks, >> Matthew Ruffell