Re: [REGRESSION] Data read from a degraded RAID 4/5/6 array could be silently corrupted.

Xiao Ni <xni@xxxxxxxxxx> · Fri, 17 Nov 2023 17:08:28 +0800

Hi all

I can reproduce this quickly with the commands mentioned at
https://www.spinics.net/lists/raid/msg73521.html in my environment.
After several hours test, this problem hasn't happened. This patch
works for me

Tested-by: Xiao Ni <xni@xxxxxxxxxx>

On Fri, Nov 17, 2023 at 12:25 AM Song Liu <song@xxxxxxxxxx> wrote:
>
> + more folks.
>
> On Fri, Nov 10, 2023 at 7:00 PM Bhanu Victor DiCara
> <00bvd0+linux@xxxxxxxxx> wrote:
> >
> > A degraded RAID 4/5/6 array can sometimes read 0s instead of the actual data.
> >
> >
> > #regzbot introduced: 10764815ff4728d2c57da677cd5d3dd6f446cf5f
> > (The problem does not occur in the previous commit.)
> >
> > In commit 10764815ff4728d2c57da677cd5d3dd6f446cf5f, file drivers/md/raid5.c, line 5808, there is `md_account_bio(mddev, &bi);`. When this line (and the previous line) is removed, the problem does not occur.
>
> The patch below should fix it. Please give it more thorough tests and
> let me know whether it fixes everything. I will send patch later with
> more details.
>
> Thanks,
> Song
>
> diff --git i/drivers/md/md.c w/drivers/md/md.c
> index 68f3bb6e89cb..d4fb1aa5c86f 100644
> --- i/drivers/md/md.c
> +++ w/drivers/md/md.c
> @@ -8674,7 +8674,8 @@ static void md_end_clone_io(struct bio *bio)
>         struct bio *orig_bio = md_io_clone->orig_bio;
>         struct mddev *mddev = md_io_clone->mddev;
>
> -       orig_bio->bi_status = bio->bi_status;
> +       if (bio->bi_status)
> +               orig_bio->bi_status = bio->bi_status;
>
>         if (md_io_clone->start_time)
>                 bio_end_io_acct(orig_bio, md_io_clone->start_time);
>
>
> >
> > Similarly, in commit ffc253263a1375a65fa6c9f62a893e9767fbebfa (v6.6), file drivers/md/raid5.c, when line 6200 is removed, the problem does not occur.
> >
> >
> > Steps to reproduce the problem (using bash or similar):
> > 1. Create a degraded RAID 4/5/6 array:
> > fallocate -l 2056M test_array_part_1.img
> > fallocate -l 2056M test_array_part_2.img
> > lo1=$(losetup --sector-size 4096 --find --nooverlap --direct-io --show  test_array_part_1.img)
> > lo2=$(losetup --sector-size 4096 --find --nooverlap --direct-io --show  test_array_part_2.img)
> > # The RAID level must be 4 or 5 or 6 with at least 1 missing drive in any order. The following configuration seems to be the most effective:
> > mdadm --create /dev/md/tmp_test_array --level=4 --raid-devices=3 --chunk=1M --size=2G  $lo1 missing $lo2
> >
> > 2. Create the test file system and clone it to the degraded array:
> > fallocate -l 4G test_fs.img
> > mke2fs -t ext4 -b 4096 -i 65536 -m 0 -E stride=256,stripe_width=512 -L test_fs  test_fs.img
> > lo3=$(losetup --sector-size 4096 --find --nooverlap --direct-io --show  test_fs.img)
> > mount $lo3 /mnt/1
> > python3 create_test_fs.py /mnt/1
> > umount /mnt/1
> > cat test_fs.img > /dev/md/tmp_test_array
> > cmp -l test_fs.img /dev/md/tmp_test_array  # Optionally verify the clone
> > mount --read-only $lo3 /mnt/1
> >
> > 3. Mount the degraded array:
> > mount --read-only /dev/md/tmp_test_array /mnt/2
> >
> > 4. Compare the files:
> > diff -q /mnt/1 /mnt/2
> >
> > If no files are detected as different, do `umount /mnt/2` and `echo 2 > /proc/sys/vm/drop_caches`, and then go to step 3.
> > (Doing `echo 3 > /proc/sys/vm/drop_caches` and then going to step 4 is less effective.)
> > (Only doing `umount /mnt/2` and/or `echo 1 > /proc/sys/vm/drop_caches` is much less effective and the effectiveness wears off.)
> >
> >
> > create_test_fs.py:
> > import errno
> > import itertools
> > import os
> > import random
> > import sys
> >
> >
> > def main(test_fs_path):
> >         rng = random.Random(0)
> >         try:
> >                 for i in itertools.count():
> >                         size = int(2**rng.uniform(12, 24))
> >                         with open(os.path.join(test_fs_path, str(i).zfill(4) + '.bin'), 'xb') as f:
> >                                 f.write(b'\xff' * size)
> >                         print(f'Created file {f.name!r} with size {size}')
> >         except OSError as e:
> >                 if e.errno != errno.ENOSPC:
> >                         raise
> >                 print(f'Done: {e.strerror} (partially created file {f.name!r})')
> >
> >
> > if __name__ == '__main__':
> >         main(sys.argv[1])
> >
> >
> >
>