On Sun, Jul 19, 2020 at 10:38 PM Coly Li <colyli@xxxxxxx> wrote: > > On 2020/7/20 13:26, Coly Li wrote: > > On 2020/7/20 12:58, Xiao Ni wrote: > >> Now the discard request is split by chunk size. So it takes a long time to finish mkfs on > >> disks which support discard function. This patch improve handling raid10 discard request. > >> It uses the similar way with raid0(29efc390b). > >> > >> But it's a little complex than raid0. Because raid10 has different layout. If raid10 is > >> offset layout and the discard request is smaller than stripe size. There are some holes > >> when we submit discard bio to underlayer disks. > >> > >> For example: five disks (disk1 - disk5) > >> D01 D02 D03 D04 D05 > >> D05 D01 D02 D03 D04 > >> D06 D07 D08 D09 D10 > >> D10 D06 D07 D08 D09 > >> The discard bio just wants to discard from D03 to D10. For disk3, there is a hole between > >> D03 and D08. For disk4, there is a hole between D04 and D09. D03 is a chunk, raid10_write_request > >> can handle one chunk perfectly. So the part that is not aligned with stripe size is still > >> handled by raid10_write_request. > >> > >> If reshape is running when discard bio comes and the discard bio spans the reshape position, > >> raid10_write_request is responsible to handle this discard bio. > >> > >> I did a test with this patch set. > >> Without patch: > >> time mkfs.xfs /dev/md0 > >> real4m39.775s > >> user0m0.000s > >> sys0m0.298s > >> > >> With patch: > >> time mkfs.xfs /dev/md0 > >> real0m0.105s > >> user0m0.000s > >> sys0m0.007s > >> > >> nvme3n1 259:1 0 477G 0 disk > >> └─nvme3n1p1 259:10 0 50G 0 part > >> nvme4n1 259:2 0 477G 0 disk > >> └─nvme4n1p1 259:11 0 50G 0 part > >> nvme5n1 259:6 0 477G 0 disk > >> └─nvme5n1p1 259:12 0 50G 0 part > >> nvme2n1 259:9 0 477G 0 disk > >> └─nvme2n1p1 259:15 0 50G 0 part > >> nvme0n1 259:13 0 477G 0 disk > >> └─nvme0n1p1 259:14 0 50G 0 part > >> > >> v1: > >> Coly helps to review these patches and give some suggestions: > >> One bug is found. If discard bio is across one stripe but bio size is bigger > >> than one stripe size. After spliting, the bio will be NULL. In this version, > >> it checks whether discard bio size is bigger than 2*stripe_size. > >> In raid10_end_discard_request, it's better to check R10BIO_Uptodate is set > >> or not. It can avoid write memory to improve performance. > >> Add more comments for calculating addresses. > >> > >> Signed-off-by: Xiao Ni <xni@xxxxxxxxxx> > > > > The code looks good to me, you may add my Reviewed-by: Coly Li > > <colyli@xxxxxxx> > > > > But I still suggest you to add a more detailed commit log, to explain > > how the discard range of each component disk is decided. Anyway this is > > just a suggestion, it is up to you. > > > > Thank you for this work. > > In raid10_end_discard_request(), the first 5 lines for local variables > declaration, there are 3 lines starts with improper blanks (should be > tab '\t'). You may find this minor issue by checkpatch.pl I guess. > > After fixing this format issue, you stil have my Reviewed-by :-) > > Coly Li Thanks to Xiao for the patch And thanks to Coly for the review. I did an experiment with a patch to increase max_discard_sectors, like: diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 14b1ba732cd7d..0b15397ebfaf9 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -3762,7 +3762,7 @@ static int raid10_run(struct mddev *mddev) chunk_size = mddev->chunk_sectors << 9; if (mddev->queue) { blk_queue_max_discard_sectors(mddev->queue, - mddev->chunk_sectors); + UINT_MAX); blk_queue_max_write_same_sectors(mddev->queue, 0); blk_queue_max_write_zeroes_sectors(mddev->queue, 0); blk_queue_io_min(mddev->queue, chunk_size); This simple change reduces the run time of mkfs.xfs on my SSD from 5+ minutes to about 14 seconds. Of course this is not as good as ~2s seconds with the set. But I wonder whether the time saving justifies extra complexity. Thoughts on this? Also, as Coly mentioned, please run checkpatch.pl on the patches. I have seen warnings on all 3 patches in the set. Thanks, Song