On 2016/12/7 下午7:50, Coly Li wrote: > On 2016/11/30 上午6:45, Avi Kivity wrote: >> On 11/29/2016 11:14 PM, NeilBrown wrote: > [snip] > >>>> So I disagree that all the work should be pushed to the merging layer. >>>> It has less information to work with, so the fewer decisions it has to >>>> make, the better. >>> I think that the merging layer should be as efficient as it reasonably >>> can be, and particularly should take into account plugging. This >>> benefits all callers. >> >> Yes, but plugging does not mean "please merge anything you can until the >> unplug". >> >>> If it can be demonstrated that changes to some of the upper layers bring >>> further improvements with acceptable costs, then certainly it is good to >>> have those too. >> >> Generating millions of requests only to merge them again is >> inefficient. It happens in an edge case (TRIM of the entirety of a very >> large RAID), but it already caused on user to believe the system >> failed. I think the system should be more robust than that. > > Neil, > > As my understand, if a large discard bio received by > raid0_make_request(), for example it requests to discard chunk 1 to 24 > on a raid0 device built by 4 SSDs. This large discard bio will be split > and written to each SSD as the following layout, > > SSD1: C1,C5,C9,C13,C17,C21 > SSD2: C2,C6,C10,C14,C18,C22 > SSD3: C3,C7,C11,C15,C19,C23 > SSD4: C4,C8,C12,C16,C20,C24 > > Current raid0 code will call generic_make_request() for 24 times for > each split bio. But it is possible to calculate the final layout of each > split bio, so we can combine all the bios into four per-SSD large bio, > like this, > > bio1 (on SSD1): C{1,5,9,13,17,21} > bio2 (on SSD2): C{2,6,10,14,18,22} > bio3 (on SSD3): C{3,7,11,15,19,23} > bio4 (on SSD4): C{4,8,12,16,20,24} > > Now we only need to call generic_make_request() for 4 times. Rebuild the > per-device discard bios is more efficient in raid0 code then in block > layer. There are some reasons that I know, > - there are splice timeout, block layer cannot merge all split bio into > one large bio before time out. > - rebuilt per-device bio in raid0 is just by a few calculation, block > layer does merge on queue with list operations, it is slower. > - raid0 code knows its on disk layout, so rebuild per-device bio is > possible here. block layer has no idea on raid0 layout, it can only do > request merge. > > Avi, > > I compose a prototype patch, the code is not simple, indeed it is quite > complicated IMHO. > > I do a little research, some NVMe SSDs support whole device size > DISCARD, also I observe mkfs.xfs sends out a raid0 device size DISCARD > bio to block layer. But raid0_make_request() only receives 512KB size > DISCARD bio, block/blk-lib.c:__blkdev_issue_discard() splits the > original large bio into 512KB small bios, the limitation is from > q->limits.discard_granularity. > > At this moment, I don't know why a q->limits.discard_granularity is > 512KB even the underlying SSD supports whole device size discard. We > also need to fix q->limits.discard_granularity, otherwise > block/blk-lib.c:__blkdev_issue_discard() still does an inefficient loop > to split the original large discard bio into smaller ones and sends them > to raid0 code by next_bio(). > > I also CC this email to linux-block@xxxxxxxxxxxxxxx to ask for help. My > question is, if a NVMe SSD supports whole-device-size DISCARD, is > q->limits.discard_granularity still necessary ? > > Here I also attach my prototype patch as a proof of concept, it is > runnable with Linux 4.9-rc7. Aha! It is not limits.discard_granularity, it is limits.max_discard_sectors. Which is set in raid0.c:raid0_run() by blk_queue_max_discard_sectors(). Here limits.maxZ_discard_sectors is set to raid0 chunk size. Interesting .... And, linux-block@xxxxxxxxxxxxxxx, please ignore my noise. Coly -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html