On Fri, Dec 09, 2016 at 12:44:57AM +0800, Coly Li wrote: > On 2016/12/8 上午12:59, Shaohua Li wrote: > > On Wed, Dec 07, 2016 at 07:50:33PM +0800, Coly Li wrote: > [snip] > > Thanks for doing this, Coly! For raid0, this totally makes sense. The raid0 > > zones make things a little complicated though. I just had a brief look of your > > proposed patch, which looks really complicated. I'd suggest something like > > this: > > 1. split the bio according to zone boundary. > > 2. handle the splitted bio. since the bio is within zone range, calculating > > the start and end sector for each rdev should be easy. > > > > Hi Shaohua, > > Thanks for your suggestion! I try to modify the code by your suggestion, > it is even more hard to make the code that way ... > > Because even split bios for each zone, all the corner cases still exist > and should be taken care in every zoon. The code will be more complicated. Not sure why it makes the code more complicated. Probably I'm wrong, but Just want to make sure we are in the same page: split the bio according to zone boundary, then handle the splitted bio separately. Calculating end/start point of each rdev for the new bio within a zone should be simple. we then clone a bio for each rdev and dispatch. So for example: Disk 0: D0 D2 D4 D6 D7 Disk 1: D1 D3 D5 zone 0 is from D0 - D5, zone 1 is from D6 - D7 If bio is from D1 to D7, we split it to 2 bios, one is D1 - D5, the other D6 - D7. For D1 - D5, we dispatch 2 bios. D1 - D5 for disk 1, D2 - D4 for disk 0 For D6 - D7, we just dispatch to disk 0. What kind of corner case makes this more complicated? > > This will create slightly more bio to each rdev (not too many, since there > > aren't too many zones in practice) and block layer should easily merge these > > bios without much overhead. The benefit is a much simpler implementation. > > > >> I compose a prototype patch, the code is not simple, indeed it is quite > >> complicated IMHO. > >> > >> I do a little research, some NVMe SSDs support whole device size > >> DISCARD, also I observe mkfs.xfs sends out a raid0 device size DISCARD > >> bio to block layer. But raid0_make_request() only receives 512KB size > >> DISCARD bio, block/blk-lib.c:__blkdev_issue_discard() splits the > >> original large bio into 512KB small bios, the limitation is from > >> q->limits.discard_granularity. > > > > please adjust the max discard sectors for the queue. The original setting is > > chunk size. > > This is a powerful suggestion, I change the max_discard_sectors to raid0 > size, and fix some bugs, now the patch looks working well. The > performance number is not bad. > > On 4x3TB NVMe raid0, format it with mkfs.xfs. Current upstream kernel > spends 306 seconds, the patched kernel spends 15 seconds. I see average > request size increases from 1 chunk (1024 sectors) to 2048 chunks > (2097152 sectors). > > I don't know why the bios are still be split before raid0_make_request() > receives them, after I set q->limits.max_discard_sectors to > mddev->array_sectors. Can anybody give me a hint ? That probably is from disk_stack_limits. try set the max_discard_sectors after it. > Here I attach the RFC v2 patch, if anybody wants to try it, please do it > and response the result :-) > > I will take time to write a very detailed commit log and code comments > to make this patch more easier to be understood. Ugly code, that's what > I have to pay to gain better performance .... Can't say I like it :). Hard to read and the memory allocation is ugly. Please check if there is simpler solution first before writting detailed commit log. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html