On Fri, May 26, 2023 at 3:12 PM Guoqing Jiang <guoqing.jiang@xxxxxxxxx> wrote: > > > > On 5/26/23 14:45, Xiao Ni wrote: > > On Fri, May 26, 2023 at 11:09 AM Guoqing Jiang <guoqing.jiang@xxxxxxxxx> wrote: > >> > >> > >> On 5/26/23 09:49, Xiao Ni wrote: > >>> Hi all > >>> > >>> We found a problem recently. The read data is wrong when recovery happens. > >>> Now we've found it's introduced by patch 10764815f (md: add io accounting > >>> for raid0 and raid5). I can reproduce this 100%. This problem exists in > >>> upstream. The test steps are like this: > >>> > >>> 1. mdadm -CR $devname -l5 -n4 /dev/sd[b-e] --force --assume-clean > >>> 2. mkfs.ext4 -F $devname > >>> 3. mount $devname $mount_point > >>> 4. mdadm --incremental --fail sdd > >>> 5. dd if=/dev/zero of=/tmp/pythontest/file1 bs=1M count=100000 status=progress > > I suppose /tmp is the mount point. /tmp/pythontest is the mount point > > >>> 6. mdadm /dev/md126 --add /dev/sdd > >>> 7. create 31 processes that writes and reads. It compares the content with > >>> md5sum. The test will go on until the recovery stops > > Could you share the test code/script for step 7? Will try it from my side. The test scripts are written by people from intel. Hi, Mariusz. Can I share the test scripts here? > > >>> 8. wait for about 10 minutes, we can see some processes report checksum is > >>> wrong. But if it re-read the data again, the checksum will be good. > > So it is interim, I guess it appeared before recover was finished. Yes, it appears before recovery finishes. The test will finish once the recovery finishes. > > >>> I tried to narrow this problem like this: > >>> > >>> - md_account_bio(mddev, &bi); > >>> + if (rw == WRITE) > >>> + md_account_bio(mddev, &bi); > >>> If it only do account for write requests, the problem can disappear. > >>> > >>> - if (rw == READ && mddev->degraded == 0 && > >>> - mddev->reshape_position == MaxSector) { > >>> - bi = chunk_aligned_read(mddev, bi); > >>> - if (!bi) > >>> - return true; > >>> - } > >>> + //if (rw == READ && mddev->degraded == 0 && > >>> + // mddev->reshape_position == MaxSector) { > >>> + // bi = chunk_aligned_read(mddev, bi); > >>> + // if (!bi) > >>> + // return true; > >>> + //} > >>> > >>> if (unlikely(bio_op(bi) == REQ_OP_DISCARD)) { > >>> make_discard_request(mddev, bi); > >>> @@ -6180,7 +6180,8 @@ static bool raid5_make_request(struct mddev *mddev, > >>> struct bio * bi) > >>> md_write_end(mddev); > >>> return true; > >>> } > >>> - md_account_bio(mddev, &bi); > >>> + if (rw == READ) > >>> + md_account_bio(mddev, &bi); > >>> > >>> I comment the chunk_aligned_read out and only account for read requests, > >>> this problem can be reproduced. > >> After a quick look,raid5_read_one_chunk clones bio by itself, so no need to > >> do it for the chunk aligned readcase. Could you pls try this? > > [...] > > >> Hi Guoqing > >> > >> When chunk_aligned_read runs successfully, it just returns. In this > >> case, it does the account by itself. If it fails to execute, it still > >> needs run md_account_bio in raid5_make_request. So now the logic > >> should be right. > > Hmm, I was confused. Which part? > > Thanks, > Guoqing > -- Best Regards Xiao Ni