Re: BUG: RAID6 recovery broken by commit 4f4fd7c5798bbdd5a03a60f6269cf1177fbd11ef (Linux 5.1.3)

Song Liu <liu.song.a23@xxxxxxxxx> · Fri, 24 May 2019 10:53:03 -0700

On Fri, May 24, 2019 at 1:11 AM Thorsten Knabe <linux@xxxxxxxxxxxxxxxxx> wrote:
>
> On 5/22/19 8:24 PM, Song Liu wrote:
> > Hi Thorsten,
> >
> > On Wed, May 22, 2019 at 9:19 AM Song Liu <liu.song.a23@xxxxxxxxx> wrote:
> >>
> >> Hi Thorsten,
> >>
> >> Thanks for the report. I will follow up with stable@ to fix them.
> >>
> >> Best regards,
> >> Song
> >
> > Could you please confirm the follow patches fixes the issue?
> >
> > commit a25d8c327bb4 ("Revert "Don't jump to compute_result state from
> > check_result state"")
> > commit b2176a1dfb51 ("md/raid: raid5 preserve the writeback action
> > after the parity check")
>
> Hello Song.
>
> With the two patches applied to Linux 5.1.4 I was not able to reproduce
> the previously observed file system and data corruptions by replacing a
> disk of a RAID6 array.
>
> Thorsten

Thanks for testing the fix!

Song

>
> >
> > Thanks,
> > Song
> >
> >
> >>
> >> On Wed, May 22, 2019 at 5:26 AM Thorsten Knabe <linux@xxxxxxxxxxxxxxxxx> wrote:
> >>>
> >>> Hello.
> >>>
> >>> BUG: RAID6 recovery broken by commit
> >>> 4f4fd7c5798bbdd5a03a60f6269cf1177fbd11ef (Linux 5.1.3+)
> >>>
> >>> Replacing a failed disk of a MD RAID6 array causes file system
> >>> corruption and data loss on kernels containing commit
> >>> 4f4fd7c5798bbdd5a03a60f6269cf1177fbd11ef.
> >>>
> >>> Affected kernels: 5.1.3, 5.1.4 possibly others.
> >>> Unaffected kernels: 5.1.2
> >>>
> >>> OS: Debian stretch amd64
> >>>
> >>> Steps to reproduce the BUG:
> >>>
> >>> 1. Create a new 4-disk RAID6 array, create a file system and mount it:
> >>>    mdadm /dev/md0 --create -l 6 -n 4 /dev/sd[bcde]
> >>>    mkfs.ext4 /dev/md0
> >>>    mount /dev/md0 /mnt
> >>> 2. Store some data (a few GB should be fine) on the RAID6 arrays file
> >>> system:
> >>>    cp -r whatever /mnt
> >>> 3. Fail a disk of the RAID6 array and remove it from the array:
> >>>    mdadm /dev/md0 --fail /dev/sdd
> >>>    mdadm /dev/md0 --remove /dev/sdd
> >>> 4. Drop caches:
> >>>    echo "3" > /proc/sys/vm/drop_caches
> >>> 5. Compare data copied to the RAID6 array in step 2 with its source:
> >>>    diff -r whatever /mnt/whatever
> >>>    There should be no differences and no file system errors.
> >>> 6. Add a new empty disk to the RAID6 array:
> >>>    mdadm /dev/md0 --add /dev/sdf
> >>> 7. RAID6 recovery should start now, wait for the RAID6 recovery to finish.
> >>> 8. Drop caches again:
> >>>    echo "3" > /proc/sys/vm/drop_caches
> >>> 9. Compare data copied to the RAID6 array in step 2 with its source again:
> >>>    diff -r whatever /mnt/whatever
> >>>    diff now reports a lot of differences and the kernel log gets filled
> >>> with file system errors. For example:
> >>>    EXT4-fs warning (device md0): ext4_dirent_csum_verify:355: inode
> >>> #918549: comm diff: No space for directory leaf checksum. Please run
> >>> e2fsck -D.
> >>>
> >>> Reverting commit 4f4fd7c5798bbdd5a03a60f6269cf1177fbd11ef from kernel
> >>> 5.1.4 resolves the issues described above.
> >>>
> >>> Kind regards
> >>> Thorsten
> >>>
> >>>
> >>> --
> >>> ___
> >>>  |        | /                 E-Mail: linux@xxxxxxxxxxxxxxxxx
> >>>  |horsten |/\nabe                WWW: http://linux.thorsten-knabe.de
> >>>
>
>
> --
> ___
>  |        | /                 E-Mail: linux@xxxxxxxxxxxxxxxxx
>  |horsten |/\nabe                WWW: http://linux.thorsten-knabe.de