On 2016/11/30 上午6:45, Avi Kivity wrote: > On 11/29/2016 11:14 PM, NeilBrown wrote: [snip] >>> So I disagree that all the work should be pushed to the merging layer. >>> It has less information to work with, so the fewer decisions it has to >>> make, the better. >> I think that the merging layer should be as efficient as it reasonably >> can be, and particularly should take into account plugging. This >> benefits all callers. > > Yes, but plugging does not mean "please merge anything you can until the > unplug". > >> If it can be demonstrated that changes to some of the upper layers bring >> further improvements with acceptable costs, then certainly it is good to >> have those too. > > Generating millions of requests only to merge them again is > inefficient. It happens in an edge case (TRIM of the entirety of a very > large RAID), but it already caused on user to believe the system > failed. I think the system should be more robust than that. Neil, As my understand, if a large discard bio received by raid0_make_request(), for example it requests to discard chunk 1 to 24 on a raid0 device built by 4 SSDs. This large discard bio will be split and written to each SSD as the following layout, SSD1: C1,C5,C9,C13,C17,C21 SSD2: C2,C6,C10,C14,C18,C22 SSD3: C3,C7,C11,C15,C19,C23 SSD4: C4,C8,C12,C16,C20,C24 Current raid0 code will call generic_make_request() for 24 times for each split bio. But it is possible to calculate the final layout of each split bio, so we can combine all the bios into four per-SSD large bio, like this, bio1 (on SSD1): C{1,5,9,13,17,21} bio2 (on SSD2): C{2,6,10,14,18,22} bio3 (on SSD3): C{3,7,11,15,19,23} bio4 (on SSD4): C{4,8,12,16,20,24} Now we only need to call generic_make_request() for 4 times. Rebuild the per-device discard bios is more efficient in raid0 code then in block layer. There are some reasons that I know, - there are splice timeout, block layer cannot merge all split bio into one large bio before time out. - rebuilt per-device bio in raid0 is just by a few calculation, block layer does merge on queue with list operations, it is slower. - raid0 code knows its on disk layout, so rebuild per-device bio is possible here. block layer has no idea on raid0 layout, it can only do request merge. Avi, I compose a prototype patch, the code is not simple, indeed it is quite complicated IMHO. I do a little research, some NVMe SSDs support whole device size DISCARD, also I observe mkfs.xfs sends out a raid0 device size DISCARD bio to block layer. But raid0_make_request() only receives 512KB size DISCARD bio, block/blk-lib.c:__blkdev_issue_discard() splits the original large bio into 512KB small bios, the limitation is from q->limits.discard_granularity. At this moment, I don't know why a q->limits.discard_granularity is 512KB even the underlying SSD supports whole device size discard. We also need to fix q->limits.discard_granularity, otherwise block/blk-lib.c:__blkdev_issue_discard() still does an inefficient loop to split the original large discard bio into smaller ones and sends them to raid0 code by next_bio(). I also CC this email to linux-block@xxxxxxxxxxxxxxx to ask for help. My question is, if a NVMe SSD supports whole-device-size DISCARD, is q->limits.discard_granularity still necessary ? Here I also attach my prototype patch as a proof of concept, it is runnable with Linux 4.9-rc7. Coly
Subject: optimization for large size DISCARD bio by per-device bios This is a very early prototype, still needs more block layer code modification to make it work. Current upstream raid0_make_request() only handles TRIM/DISCARD bio by chunk size, it meams for large raid0 device built by SSDs will call million times generic_make_request() for the split bio. This patch tries to combine small bios into large one if they are on same real device and continuous on this real device, then send the combined large bio to underlying device by single call to generic_make_request(). For example, use mkfs.xfs to trim a raid0 device built with 4 x 3TB NVMeSSD, current upstream raid0_make_request() will call generic_make_request() 5.7 million times, with this patch only 4 calls to generic_make_request() is required. This patch won't work in real world, because in block/blk-lib.c: __blkdev_issue_discard() the original large bio will be split into smaller ones by restriction of discard_granularity. If some day SSD supports whole device sized discard_granularity, it will be very interesting then... The basic idea is, if a large discard bio received by raid0_make_request(), for example it requests to discard chunk 1 to 24 on a raid0 device built by 4 SSDs. This large discard bio will be split and written to each SSD as the following layout, SSD1: C1,C5,C9,C13,C17,C21 SSD2: C2,C6,C10,C14,C18,C22 SSD3: C3,C7,C11,C15,C19,C23 SSD4: C4,C8,C12,C16,C20,C24 Current raid0 code will call generic_make_request() for 24 times for each split bio. But it is possible to calculate the final layout of each split bio, so we can combine all the bios into four per-SSD large bio, like this, bio1 (on SSD1): C{1,5,9,13,17,21} bio2 (on SSD2): C{2,6,10,14,18,22} bio3 (on SSD3): C{3,7,11,15,19,23} bio4 (on SSD4): C{4,8,12,16,20,24} Now we only need to call generic_make_request() for 4 times. The code is not simple, I need more time to write text to complain how it works. Currently you can treat it as a proof of concept. Signed-off-by: Coly Li <colyli@xxxxxxx> --- drivers/md/raid0.c | 224 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 224 insertions(+) diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index 258986a..73c5fac 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -452,6 +452,225 @@ static inline int is_io_in_chunk_boundary(struct mddev *mddev, } } + +struct bio_record { + sector_t bi_sector; + unsigned long sectors; + struct md_rdev *rdev; +}; + +static void handle_discard_request(struct mddev *mddev, struct bio *bio) +{ + struct bio_record *recs = NULL; + struct bio *split; + struct r0conf *conf = mddev->private; + sector_t sectors, sector; + struct strip_zone *first_zone; + int zone_idx; + sector_t zone_start, zone_end; + int nr_strip_zones = conf->nr_strip_zones; + int disks; + int first_rdev_idx = -1, rdev_idx; + struct md_rdev *first_rdev; + unsigned int chunk_sects = mddev->chunk_sectors; + + sector = bio->bi_iter.bi_sector; + first_zone = find_zone(conf, §or); + first_rdev = map_sector(mddev, first_zone, sector, §or); + + sectors = chunk_sects - + (likely(is_power_of_2(chunk_sects)) + ? (sector & (chunk_sects - 1)) + : sector_div(sector, chunk_sects)); + + /* if bio size is not exceed a chunk boundary, + * simply handle it here. + */ + if (sectors >= bio_sectors(bio)) { + bio->bi_bdev = first_rdev->bdev; + bio->bi_iter.bi_sector = sector + first_zone->dev_start + + first_rdev->data_offset; + if (unlikely(!blk_queue_discard(bdev_get_queue(bio->bi_bdev)))) + /* Just ignore it */ + bio_endio(bio); + else + generic_make_request(bio); + return; + } + + /* bio is large enough to be split, allocate recs firstly */ + disks = mddev->raid_disks; + recs = kcalloc(disks, sizeof(struct bio_record), GFP_NOIO); + if (recs == NULL) { + printk(KERN_ERR "md/raid0:%s: failed to allocate memory " \ + "for bio_record", mdname(mddev)); + bio->bi_error = -ENOMEM; + bio_endio(bio); + return; + } + + zone_idx = first_zone - conf->strip_zone; + for (rdev_idx = 0; rdev_idx < first_zone->nb_dev; rdev_idx++) { + struct md_rdev *rdev; + + rdev = conf->devlist[zone_idx * disks + rdev_idx]; + recs[rdev_idx].rdev = rdev; + if (rdev == first_rdev) + first_rdev_idx = rdev_idx; + } + + /* Restore due to sector_div */ + sector = bio->bi_iter.bi_sector; + recs[first_rdev_idx].bi_sector = sector + first_zone->dev_start; + recs[first_rdev_idx].sectors = sectors; + BUG_ON(recs[first_rdev_idx].rdev != first_rdev); + + /* recs[first_rdev_idx] is initialized with 'sectors', we need to + * handle the rested sectors, which is sotred in 'sectors' too. + */ + sectors = bio_sectors(bio) - sectors; + + /* bio may not be chunk size aligned, the split bio on first rdev + * may not be chunk size aligned too. But the rested split bios + * on rested rdevs must be chunk size aligned, and aligned to + * round down chunk number. + */ + zone_end = first_zone->zone_end; + rdev_idx = first_rdev_idx + 1; + sector = likely(is_power_of_2(chunk_sects)) + ? sector & (~(chunk_sects - 1)) + : chunk_sects * (sector/chunk_sects); + + while (rdev_idx < first_zone->nb_dev) { + recs[rdev_idx].bi_sector = sector + first_zone->dev_start; + if (sectors <= chunk_sects) { + recs[rdev_idx].sectors = sectors; + goto issue; + } + recs[rdev_idx].sectors = chunk_sects; + sectors -= chunk_sects; + rdev_idx++; + } + + sector += chunk_sects; + zone_start = sector + first_zone->dev_start; + if (zone_start == zone_end) { + zone_idx++; + zone_start = conf->strip_zone[zone_idx].dev_start; + } + + while (zone_idx < nr_strip_zones) { + int rdevs_in_zone = conf->strip_zone[zone_idx].nb_dev; + int chunks_per_rdev, rested_chunks, rested_sectors; + sector_t zone_sectors, grow_sectors; + int add_rested_sectors = 0; + + zone_end = conf->strip_zone[zone_idx].zone_end; + zone_sectors = zone_end - zone_start; + chunks_per_rdev = sectors; + rested_sectors = + sector_div(chunks_per_rdev, chunk_sects * rdevs_in_zone); + rested_chunks = rested_sectors; + rested_sectors = sector_div(rested_chunks, chunk_sects); + + if ((chunks_per_rdev * chunk_sects) > zone_sectors) + chunks_per_rdev = zone_sectors/chunk_sects; + + /* rested_chunks and rested_sectors go into next zone, we won't + * handle them in this zone. Set them to 0. + */ + if ((chunks_per_rdev * chunk_sects) == zone_sectors && + (rested_chunks != 0 || rested_sectors != 0)) { + if (rested_chunks != 0) + rested_chunks = 0; + if (rested_sectors != 0) + rested_sectors = 0; + } + + if (rested_chunks == 0 && rested_sectors != 0) + add_rested_sectors = 1; + + for (rdev_idx = 0; rdev_idx < rdevs_in_zone; rdev_idx++) { + /* if .sectors is not initailized (== 0), it indicates + * .bi_sector is not initialized neither. We initiate + * .bi_sector firstly, then set .sectors by + * grow_sectors. + */ + if (recs[rdev_idx].sectors == 0) + recs[rdev_idx].bi_sector = zone_start; + grow_sectors = chunks_per_rdev * chunk_sects; + if (rested_chunks) { + grow_sectors += chunk_sects; + rested_chunks--; + if (rested_chunks == 0 && + rested_sectors != 0) { + recs[rdev_idx].sectors += grow_sectors; + sectors -= grow_sectors; + add_rested_sectors = 1; + continue; + } + } + + /* if add_rested_sectors != 0, it indicates + * rested_sectors != 0 + */ + if (add_rested_sectors) + grow_sectors += rested_sectors; + recs[rdev_idx].sectors += grow_sectors; + sectors -= grow_sectors; + if (add_rested_sectors) + break; + } + + if (sectors == 0) + break; + zone_start = zone_end; + zone_idx++; + BUG_ON(zone_start != conf->strip_zone[zone_idx].dev_start); + } + + +issue: + /* recs contains the re-ordered requests layout, now we can + * chain split bios from recs + */ + for (rdev_idx = 0; rdev_idx < disks; rdev_idx++) { + if (rdev_idx == first_rdev_idx || + recs[rdev_idx].sectors == 0) + continue; + split = bio_split(bio, + recs[rdev_idx].sectors, + GFP_NOIO, + fs_bio_set); + if (split == NULL) + break; + bio_chain(split, bio); + BUG_ON(split->bi_iter.bi_size != recs[rdev_idx].sectors << 9); + split->bi_bdev = recs[rdev_idx].rdev->bdev; + split->bi_iter.bi_sector = recs[rdev_idx].bi_sector + + recs[rdev_idx].rdev->data_offset; + + if (unlikely(!blk_queue_discard( + bdev_get_queue(split->bi_bdev)))) + /* Just ignore it */ + bio_endio(split); + else + generic_make_request(split); + } + BUG_ON(bio->bi_iter.bi_size != recs[first_rdev_idx].sectors << 9); + bio->bi_iter.bi_sector = recs[first_rdev_idx].bi_sector + + recs[first_rdev_idx].rdev->data_offset; + bio->bi_bdev = recs[first_rdev_idx].rdev->bdev; + + if (unlikely(!blk_queue_discard(bdev_get_queue(bio->bi_bdev)))) + /* Just ignore it */ + bio_endio(bio); + else + generic_make_request(bio); + + kfree(recs); +} + static void raid0_make_request(struct mddev *mddev, struct bio *bio) { struct strip_zone *zone; @@ -463,6 +682,11 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio) return; } + if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) { + handle_discard_request(mddev, bio); + return; + } + do { sector_t sector = bio->bi_iter.bi_sector; unsigned chunk_sects = mddev->chunk_sectors;