Re: The read data is wrong from raid5 when recovery happens

Xiao Ni <xni@xxxxxxxxxx> · Fri, 26 May 2023 15:23:58 +0800

On Fri, May 26, 2023 at 3:12 PM Guoqing Jiang <guoqing.jiang@xxxxxxxxx> wrote:
>
>
>
> On 5/26/23 14:45, Xiao Ni wrote:
> > On Fri, May 26, 2023 at 11:09 AM Guoqing Jiang <guoqing.jiang@xxxxxxxxx> wrote:
> >>
> >>
> >> On 5/26/23 09:49, Xiao Ni wrote:
> >>> Hi all
> >>>
> >>> We found a problem recently. The read data is wrong when recovery happens.
> >>> Now we've found it's introduced by patch 10764815f (md: add io accounting
> >>> for raid0 and raid5). I can reproduce this 100%. This problem exists in
> >>> upstream. The test steps are like this:
> >>>
> >>> 1. mdadm -CR $devname -l5 -n4 /dev/sd[b-e] --force --assume-clean
> >>> 2. mkfs.ext4 -F $devname
> >>> 3. mount $devname $mount_point
> >>> 4. mdadm --incremental --fail sdd
> >>> 5. dd if=/dev/zero of=/tmp/pythontest/file1 bs=1M count=100000 status=progress
>
> I suppose /tmp is the mount point.

/tmp/pythontest is the mount point

>
> >>> 6. mdadm /dev/md126 --add /dev/sdd
> >>> 7. create 31 processes that writes and reads. It compares the content with
> >>> md5sum. The test will go on until the recovery stops
>
> Could you share the test code/script for step 7? Will try it from my side.

The test scripts are written by people from intel.
Hi, Mariusz. Can I share the test scripts here?

>
> >>> 8. wait for about 10 minutes, we can see some processes report checksum is
> >>> wrong. But if it re-read the data again, the checksum will be good.
>
> So it is interim, I guess it appeared before recover was finished.

Yes, it appears before recovery finishes. The test will finish once
the recovery finishes.

>
> >>> I tried to narrow this problem like this:
> >>>
> >>> -       md_account_bio(mddev, &bi);
> >>> +       if (rw == WRITE)
> >>> +               md_account_bio(mddev, &bi);
> >>> If it only do account for write requests, the problem can disappear.
> >>>
> >>> -       if (rw == READ && mddev->degraded == 0 &&
> >>> -           mddev->reshape_position == MaxSector) {
> >>> -               bi = chunk_aligned_read(mddev, bi);
> >>> -               if (!bi)
> >>> -                       return true;
> >>> -       }
> >>> +       //if (rw == READ && mddev->degraded == 0 &&
> >>> +       //    mddev->reshape_position == MaxSector) {
> >>> +       //      bi = chunk_aligned_read(mddev, bi);
> >>> +       //      if (!bi)
> >>> +       //              return true;
> >>> +       //}
> >>>
> >>>           if (unlikely(bio_op(bi) == REQ_OP_DISCARD)) {
> >>>                   make_discard_request(mddev, bi);
> >>> @@ -6180,7 +6180,8 @@ static bool raid5_make_request(struct mddev *mddev,
> >>> struct bio * bi)
> >>>                           md_write_end(mddev);
> >>>                   return true;
> >>>           }
> >>> -       md_account_bio(mddev, &bi);
> >>> +       if (rw == READ)
> >>> +               md_account_bio(mddev, &bi);
> >>>
> >>> I comment the chunk_aligned_read out and only account for read requests,
> >>> this problem can be reproduced.
> >> After a quick look,raid5_read_one_chunk clones bio by itself, so no need to
> >> do it for the chunk aligned readcase. Could you pls try this?
>
> [...]
>
> >> Hi Guoqing
> >>
> >> When chunk_aligned_read runs successfully, it just returns. In this
> >> case, it does the account by itself. If it fails to execute, it still
> >> needs run md_account_bio in raid5_make_request. So now the logic
> >> should be right.
>
> Hmm, I was confused.

Which part?
>
> Thanks,
> Guoqing
>

-- 
Best Regards
Xiao Ni