Re: The read data is wrong from raid5 when recovery happens

Xiao Ni <xni@xxxxxxxxxx> · Fri, 26 May 2023 14:45:48 +0800

On Fri, May 26, 2023 at 11:09 AM Guoqing Jiang <guoqing.jiang@xxxxxxxxx> wrote:
>
>
>
> On 5/26/23 09:49, Xiao Ni wrote:
> > Hi all
> >
> > We found a problem recently. The read data is wrong when recovery happens.
> > Now we've found it's introduced by patch 10764815f (md: add io accounting
> > for raid0 and raid5). I can reproduce this 100%. This problem exists in
> > upstream. The test steps are like this:
> >
> > 1. mdadm -CR $devname -l5 -n4 /dev/sd[b-e] --force --assume-clean
> > 2. mkfs.ext4 -F $devname
> > 3. mount $devname $mount_point
> > 4. mdadm --incremental --fail sdd
> > 5. dd if=/dev/zero of=/tmp/pythontest/file1 bs=1M count=100000
> > status=progress
> > 6. mdadm /dev/md126 --add /dev/sdd
> > 7. create 31 processes that writes and reads. It compares the content with
> > md5sum. The test will go on until the recovery stops
> > 8. wait for about 10 minutes, we can see some processes report checksum is
> > wrong. But if it re-read the data again, the checksum will be good.
> >
> > I tried to narrow this problem like this:
> >
> > -       md_account_bio(mddev, &bi);
> > +       if (rw == WRITE)
> > +               md_account_bio(mddev, &bi);
> > If it only do account for write requests, the problem can disappear.
> >
> > -       if (rw == READ && mddev->degraded == 0 &&
> > -           mddev->reshape_position == MaxSector) {
> > -               bi = chunk_aligned_read(mddev, bi);
> > -               if (!bi)
> > -                       return true;
> > -       }
> > +       //if (rw == READ && mddev->degraded == 0 &&
> > +       //    mddev->reshape_position == MaxSector) {
> > +       //      bi = chunk_aligned_read(mddev, bi);
> > +       //      if (!bi)
> > +       //              return true;
> > +       //}
> >
> >          if (unlikely(bio_op(bi) == REQ_OP_DISCARD)) {
> >                  make_discard_request(mddev, bi);
> > @@ -6180,7 +6180,8 @@ static bool raid5_make_request(struct mddev *mddev,
> > struct bio * bi)
> >                          md_write_end(mddev);
> >                  return true;
> >          }
> > -       md_account_bio(mddev, &bi);
> > +       if (rw == READ)
> > +               md_account_bio(mddev, &bi);
> >
> > I comment the chunk_aligned_read out and only account for read requests,
> > this problem can be reproduced.
>
> After a quick look,raid5_read_one_chunk clones bio by itself, so no need to
> do it for the chunk aligned readcase. Could you pls try this?
>
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -6120,6 +6120,7 @@static bool raid5_make_request(struct mddev *mddev,
> struct bio * bi)
>         const int rw = bio_data_dir(bi);
>         enum stripe_result res;
>         int s, stripe_cnt;
> +bool account_bio = true;
>
>         if (unlikely(bi->bi_opf & REQ_PREFLUSH)) {
>                 int ret = log_handle_flush_request(conf, bi);
> @@ -6148,6 +6149,7 @@static bool raid5_make_request(struct mddev *mddev,
> struct bio * bi)
>         if (rw == READ && mddev->degraded == 0 &&
>             mddev->reshape_position == MaxSector) {
>                 bi = chunk_aligned_read(mddev, bi);
> +account_bio = false;
>                 if (!bi)
>                         return true;
>         }
> @@ -6180,7 +6182,8 @@static bool raid5_make_request(struct mddev *mddev,
> struct bio * bi)
>                         md_write_end(mddev);
>                 return true;
>         }
> -       md_account_bio(mddev, &bi);
> +if (account_bio)
> +md_account_bio(mddev, &bi);
>
>
> Thanks,
> Guoqing
>

Hi Guoqing

When chunk_aligned_read runs successfully, it just returns. In this
case, it does the account by itself. If it fails to execute, it still
needs run md_account_bio in raid5_make_request. So now the logic
should be right.

And by the way, in the original email, as mentioned, I commented the
chunk aligned read codes out, and it still can be reproduced.

-- 
Best Regards
Xiao Ni