Re: The read data is wrong from raid5 when recovery happens

Xiao Ni <xni@xxxxxxxxxx> · Mon, 29 May 2023 16:40:20 +0800

On Mon, May 29, 2023 at 4:34 PM Guoqing Jiang <guoqing.jiang@xxxxxxxxx> wrote:
>
>
>
> On 5/29/23 11:41, Xiao Ni wrote:
> > On Mon, May 29, 2023 at 10:27 AM Guoqing Jiang <guoqing.jiang@xxxxxxxxx> wrote:
> >>
> >>
> >> On 5/26/23 15:23, Xiao Ni wrote:
> >>>>>>> 6. mdadm /dev/md126 --add /dev/sdd
> >>>>>>> 7. create 31 processes that writes and reads. It compares the content with
> >>>>>>> md5sum. The test will go on until the recovery stops
> >>>> Could you share the test code/script for step 7? Will try it from my side.
> >>> The test scripts are written by people from intel.
> >>> Hi, Mariusz. Can I share the test scripts here?
> >>>
> >>>>>>> 8. wait for about 10 minutes, we can see some processes report checksum is
> >>>>>>> wrong. But if it re-read the data again, the checksum will be good.
> >>>> So it is interim, I guess it appeared before recover was finished.
> >>> Yes, it appears before recovery finishes. The test will finish once
> >>> the recovery finishes.
> >>>
> >>>>>>> I tried to narrow this problem like this:
> >>>>>>>
> >>>>>>> -       md_account_bio(mddev, &bi);
> >>>>>>> +       if (rw == WRITE)
> >>>>>>> +               md_account_bio(mddev, &bi);
> >>>>>>> If it only do account for write requests, the problem can disappear.
> >>>>>>>
> >>>>>>> -       if (rw == READ && mddev->degraded == 0 &&
> >>>>>>> -           mddev->reshape_position == MaxSector) {
> >>>>>>> -               bi = chunk_aligned_read(mddev, bi);
> >>>>>>> -               if (!bi)
> >>>>>>> -                       return true;
> >>>>>>> -       }
> >>>>>>> +       //if (rw == READ && mddev->degraded == 0 &&
> >>>>>>> +       //    mddev->reshape_position == MaxSector) {
> >>>>>>> +       //      bi = chunk_aligned_read(mddev, bi);
> >>>>>>> +       //      if (!bi)
> >>>>>>> +       //              return true;
> >>>>>>> +       //}
> >>>>>>>
> >>>>>>>             if (unlikely(bio_op(bi) == REQ_OP_DISCARD)) {
> >>>>>>>                     make_discard_request(mddev, bi);
> >>>>>>> @@ -6180,7 +6180,8 @@ static bool raid5_make_request(struct mddev *mddev,
> >>>>>>> struct bio * bi)
> >>>>>>>                             md_write_end(mddev);
> >>>>>>>                     return true;
> >>>>>>>             }
> >>>>>>> -       md_account_bio(mddev, &bi);
> >>>>>>> +       if (rw == READ)
> >>>>>>> +               md_account_bio(mddev, &bi);
> >>>>>>>
> >>>>>>> I comment the chunk_aligned_read out and only account for read requests,
> >>>>>>> this problem can be reproduced.
> >> Only write bio and non aligned chunk read bio call md_account_bio, and
> >> only account write bio is fine per your test. It means the md5sum didn't match
> >> because of non aligned chunk read bio, so it is not abnormal that data in another chunk could
> >> be changed with the recovery is not finished, right?
> > That's right, only non aligned read requests can cause this problem.
> > Good catch. If I understand right, you mean the non aligned read
> > request reads data from the chunk which hasn't been recovered, right?
>
> Yes, I don't think compare md5sum for such scenario makes more sense given
> the state is interim. And it also appeared in my test with disable io
> accounting.

It's the data which customers read that can be wrong. So it's a
dangerous thing. Because customers use the data directly. If it's true
it looks like there is a race between non aligned read and recovery
process.

Regards
Xiao

>
> >> BTW, I had run the test with bio accounting disabled by default, and
> >> seems the result is
> >> same.
> >>
> >>> git tag  --sort=taggerdate --contain 10764815f |head -1
> >> v5.14-rc1
> >>
> >> localhost:~/readdata #uname -r
> >> 5.15.0-rc4-59.24-default
> >> localhost:~/readdata #cat /sys/block/md126/queue/iostats
> >> 0
> >>
> >> And I can still see relevant log from the terminal which runs 01-test.sh
> > Hmm, thanks for this. I'll have a try again. Which kind of disks do
> > you use for testing?
>
> Four SCSI disks (1G capacity) inside VM.
>
> Thanks,
> Guoqing
>