Re: The read data is wrong from raid5 when recovery happens

Guoqing Jiang <guoqing.jiang@xxxxxxxxx> · Fri, 26 May 2023 11:09:24 +0800

On 5/26/23 09:49, Xiao Ni wrote:
Hi all

We found a problem recently. The read data is wrong when recovery happens.
Now we've found it's introduced by patch 10764815f (md: add io accounting
for raid0 and raid5). I can reproduce this 100%. This problem exists in
upstream. The test steps are like this:

1. mdadm -CR $devname -l5 -n4 /dev/sd[b-e] --force --assume-clean
2. mkfs.ext4 -F $devname
3. mount $devname $mount_point
4. mdadm --incremental --fail sdd
5. dd if=/dev/zero of=/tmp/pythontest/file1 bs=1M count=100000
status=progress
6. mdadm /dev/md126 --add /dev/sdd
7. create 31 processes that writes and reads. It compares the content with
md5sum. The test will go on until the recovery stops
8. wait for about 10 minutes, we can see some processes report checksum is
wrong. But if it re-read the data again, the checksum will be good.

I tried to narrow this problem like this:

-       md_account_bio(mddev, &bi);
+       if (rw == WRITE)
+               md_account_bio(mddev, &bi);
If it only do account for write requests, the problem can disappear.

-       if (rw == READ && mddev->degraded == 0 &&
-           mddev->reshape_position == MaxSector) {
-               bi = chunk_aligned_read(mddev, bi);
-               if (!bi)
-                       return true;
-       }
+       //if (rw == READ && mddev->degraded == 0 &&
+       //    mddev->reshape_position == MaxSector) {
+       //      bi = chunk_aligned_read(mddev, bi);
+       //      if (!bi)
+       //              return true;
+       //}

         if (unlikely(bio_op(bi) == REQ_OP_DISCARD)) {
                 make_discard_request(mddev, bi);
@@ -6180,7 +6180,8 @@ static bool raid5_make_request(struct mddev *mddev,
struct bio * bi)
                         md_write_end(mddev);
                 return true;
         }
-       md_account_bio(mddev, &bi);
+       if (rw == READ)
+               md_account_bio(mddev, &bi);

I comment the chunk_aligned_read out and only account for read requests,
this problem can be reproduced.

After a quick look,raid5_read_one_chunk clones bio by itself, so no need to
do it for the chunk aligned readcase. Could you pls try this?

--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6120,6 +6120,7 @@static bool raid5_make_request(struct mddev *mddev, 
struct bio * bi)
       const int rw = bio_data_dir(bi);
       enum stripe_result res;
       int s, stripe_cnt;
+bool account_bio = true;

       if (unlikely(bi->bi_opf & REQ_PREFLUSH)) {
               int ret = log_handle_flush_request(conf, bi);
@@ -6148,6 +6149,7 @@static bool raid5_make_request(struct mddev *mddev, 
struct bio * bi)
       if (rw == READ && mddev->degraded == 0 &&
           mddev->reshape_position == MaxSector) {
               bi = chunk_aligned_read(mddev, bi);
+account_bio = false;
               if (!bi)
                       return true;
       }
@@ -6180,7 +6182,8 @@static bool raid5_make_request(struct mddev *mddev, 
struct bio * bi)
                       md_write_end(mddev);
               return true;
       }
-       md_account_bio(mddev, &bi);
+if (account_bio)
+md_account_bio(mddev, &bi);


Thanks,
Guoqing