On Fri, May 26, 2023 at 11:09 AM Guoqing Jiang <guoqing.jiang@xxxxxxxxx> wrote: > > > > On 5/26/23 09:49, Xiao Ni wrote: > > Hi all > > > > We found a problem recently. The read data is wrong when recovery happens. > > Now we've found it's introduced by patch 10764815f (md: add io accounting > > for raid0 and raid5). I can reproduce this 100%. This problem exists in > > upstream. The test steps are like this: > > > > 1. mdadm -CR $devname -l5 -n4 /dev/sd[b-e] --force --assume-clean > > 2. mkfs.ext4 -F $devname > > 3. mount $devname $mount_point > > 4. mdadm --incremental --fail sdd > > 5. dd if=/dev/zero of=/tmp/pythontest/file1 bs=1M count=100000 > > status=progress > > 6. mdadm /dev/md126 --add /dev/sdd > > 7. create 31 processes that writes and reads. It compares the content with > > md5sum. The test will go on until the recovery stops > > 8. wait for about 10 minutes, we can see some processes report checksum is > > wrong. But if it re-read the data again, the checksum will be good. > > > > I tried to narrow this problem like this: > > > > - md_account_bio(mddev, &bi); > > + if (rw == WRITE) > > + md_account_bio(mddev, &bi); > > If it only do account for write requests, the problem can disappear. > > > > - if (rw == READ && mddev->degraded == 0 && > > - mddev->reshape_position == MaxSector) { > > - bi = chunk_aligned_read(mddev, bi); > > - if (!bi) > > - return true; > > - } > > + //if (rw == READ && mddev->degraded == 0 && > > + // mddev->reshape_position == MaxSector) { > > + // bi = chunk_aligned_read(mddev, bi); > > + // if (!bi) > > + // return true; > > + //} > > > > if (unlikely(bio_op(bi) == REQ_OP_DISCARD)) { > > make_discard_request(mddev, bi); > > @@ -6180,7 +6180,8 @@ static bool raid5_make_request(struct mddev *mddev, > > struct bio * bi) > > md_write_end(mddev); > > return true; > > } > > - md_account_bio(mddev, &bi); > > + if (rw == READ) > > + md_account_bio(mddev, &bi); > > > > I comment the chunk_aligned_read out and only account for read requests, > > this problem can be reproduced. > > After a quick look,raid5_read_one_chunk clones bio by itself, so no need to > do it for the chunk aligned readcase. Could you pls try this? > > --- a/drivers/md/raid5.c > +++ b/drivers/md/raid5.c > @@ -6120,6 +6120,7 @@static bool raid5_make_request(struct mddev *mddev, > struct bio * bi) > const int rw = bio_data_dir(bi); > enum stripe_result res; > int s, stripe_cnt; > +bool account_bio = true; > > if (unlikely(bi->bi_opf & REQ_PREFLUSH)) { > int ret = log_handle_flush_request(conf, bi); > @@ -6148,6 +6149,7 @@static bool raid5_make_request(struct mddev *mddev, > struct bio * bi) > if (rw == READ && mddev->degraded == 0 && > mddev->reshape_position == MaxSector) { > bi = chunk_aligned_read(mddev, bi); > +account_bio = false; > if (!bi) > return true; > } > @@ -6180,7 +6182,8 @@static bool raid5_make_request(struct mddev *mddev, > struct bio * bi) > md_write_end(mddev); > return true; > } > - md_account_bio(mddev, &bi); > +if (account_bio) > +md_account_bio(mddev, &bi); > > > Thanks, > Guoqing > Hi Guoqing When chunk_aligned_read runs successfully, it just returns. In this case, it does the account by itself. If it fails to execute, it still needs run md_account_bio in raid5_make_request. So now the logic should be right. And by the way, in the original email, as mentioned, I commented the chunk aligned read codes out, and it still can be reproduced. -- Best Regards Xiao Ni