Re: BUG: RAID6 recovery broken by commit 4f4fd7c5798bbdd5a03a60f6269cf1177fbd11ef (Linux 5.1.3)

Thorsten Knabe <linux@xxxxxxxxxxxxxxxxx> · Fri, 24 May 2019 10:10:30 +0200

On 5/22/19 8:24 PM, Song Liu wrote:
> Hi Thorsten,
> 
> On Wed, May 22, 2019 at 9:19 AM Song Liu <liu.song.a23@xxxxxxxxx> wrote:
>>
>> Hi Thorsten,
>>
>> Thanks for the report. I will follow up with stable@ to fix them.
>>
>> Best regards,
>> Song
> 
> Could you please confirm the follow patches fixes the issue?
> 
> commit a25d8c327bb4 ("Revert "Don't jump to compute_result state from
> check_result state"")
> commit b2176a1dfb51 ("md/raid: raid5 preserve the writeback action
> after the parity check")

Hello Song.

With the two patches applied to Linux 5.1.4 I was not able to reproduce
the previously observed file system and data corruptions by replacing a
disk of a RAID6 array.

Thorsten

> 
> Thanks,
> Song
> 
> 
>>
>> On Wed, May 22, 2019 at 5:26 AM Thorsten Knabe <linux@xxxxxxxxxxxxxxxxx> wrote:
>>>
>>> Hello.
>>>
>>> BUG: RAID6 recovery broken by commit
>>> 4f4fd7c5798bbdd5a03a60f6269cf1177fbd11ef (Linux 5.1.3+)
>>>
>>> Replacing a failed disk of a MD RAID6 array causes file system
>>> corruption and data loss on kernels containing commit
>>> 4f4fd7c5798bbdd5a03a60f6269cf1177fbd11ef.
>>>
>>> Affected kernels: 5.1.3, 5.1.4 possibly others.
>>> Unaffected kernels: 5.1.2
>>>
>>> OS: Debian stretch amd64
>>>
>>> Steps to reproduce the BUG:
>>>
>>> 1. Create a new 4-disk RAID6 array, create a file system and mount it:
>>>    mdadm /dev/md0 --create -l 6 -n 4 /dev/sd[bcde]
>>>    mkfs.ext4 /dev/md0
>>>    mount /dev/md0 /mnt
>>> 2. Store some data (a few GB should be fine) on the RAID6 arrays file
>>> system:
>>>    cp -r whatever /mnt
>>> 3. Fail a disk of the RAID6 array and remove it from the array:
>>>    mdadm /dev/md0 --fail /dev/sdd
>>>    mdadm /dev/md0 --remove /dev/sdd
>>> 4. Drop caches:
>>>    echo "3" > /proc/sys/vm/drop_caches
>>> 5. Compare data copied to the RAID6 array in step 2 with its source:
>>>    diff -r whatever /mnt/whatever
>>>    There should be no differences and no file system errors.
>>> 6. Add a new empty disk to the RAID6 array:
>>>    mdadm /dev/md0 --add /dev/sdf
>>> 7. RAID6 recovery should start now, wait for the RAID6 recovery to finish.
>>> 8. Drop caches again:
>>>    echo "3" > /proc/sys/vm/drop_caches
>>> 9. Compare data copied to the RAID6 array in step 2 with its source again:
>>>    diff -r whatever /mnt/whatever
>>>    diff now reports a lot of differences and the kernel log gets filled
>>> with file system errors. For example:
>>>    EXT4-fs warning (device md0): ext4_dirent_csum_verify:355: inode
>>> #918549: comm diff: No space for directory leaf checksum. Please run
>>> e2fsck -D.
>>>
>>> Reverting commit 4f4fd7c5798bbdd5a03a60f6269cf1177fbd11ef from kernel
>>> 5.1.4 resolves the issues described above.
>>>
>>> Kind regards
>>> Thorsten
>>>
>>>
>>> --
>>> ___
>>>  |        | /                 E-Mail: linux@xxxxxxxxxxxxxxxxx
>>>  |horsten |/\nabe                WWW: http://linux.thorsten-knabe.de
>>>

-- 
___
 |        | /                 E-Mail: linux@xxxxxxxxxxxxxxxxx
 |horsten |/\nabe                WWW: http://linux.thorsten-knabe.de