On 2016/12/8 上午12:59, Shaohua Li wrote: > On Wed, Dec 07, 2016 at 07:50:33PM +0800, Coly Li wrote: [snip] > Thanks for doing this, Coly! For raid0, this totally makes sense. The raid0 > zones make things a little complicated though. I just had a brief look of your > proposed patch, which looks really complicated. I'd suggest something like > this: > 1. split the bio according to zone boundary. > 2. handle the splitted bio. since the bio is within zone range, calculating > the start and end sector for each rdev should be easy. > Hi Shaohua, Thanks for your suggestion! I try to modify the code by your suggestion, it is even more hard to make the code that way ... Because even split bios for each zone, all the corner cases still exist and should be taken care in every zoon. The code will be more complicated. > This will create slightly more bio to each rdev (not too many, since there > aren't too many zones in practice) and block layer should easily merge these > bios without much overhead. The benefit is a much simpler implementation. > >> I compose a prototype patch, the code is not simple, indeed it is quite >> complicated IMHO. >> >> I do a little research, some NVMe SSDs support whole device size >> DISCARD, also I observe mkfs.xfs sends out a raid0 device size DISCARD >> bio to block layer. But raid0_make_request() only receives 512KB size >> DISCARD bio, block/blk-lib.c:__blkdev_issue_discard() splits the >> original large bio into 512KB small bios, the limitation is from >> q->limits.discard_granularity. > > please adjust the max discard sectors for the queue. The original setting is > chunk size. This is a powerful suggestion, I change the max_discard_sectors to raid0 size, and fix some bugs, now the patch looks working well. The performance number is not bad. On 4x3TB NVMe raid0, format it with mkfs.xfs. Current upstream kernel spends 306 seconds, the patched kernel spends 15 seconds. I see average request size increases from 1 chunk (1024 sectors) to 2048 chunks (2097152 sectors). I don't know why the bios are still be split before raid0_make_request() receives them, after I set q->limits.max_discard_sectors to mddev->array_sectors. Can anybody give me a hint ? Here I attach the RFC v2 patch, if anybody wants to try it, please do it and response the result :-) I will take time to write a very detailed commit log and code comments to make this patch more easier to be understood. Ugly code, that's what I have to pay to gain better performance .... Thanks in advance. Coly
Subject: [RFC v2] optimization for large size DISCARD bio by per-device bios This is a very early prototype, still needs more block layer code modification to make it work. Current upstream raid0_make_request() only handles TRIM/DISCARD bio by chunk size, it meams for large raid0 device built by SSDs will call million times generic_make_request() for the split bio. This patch tries to combine small bios into large one if they are on same real device and continuous on this real device, then send the combined large bio to underlying device by single call to generic_make_request(). For example, use mkfs.xfs to trim a raid0 device built with 4 x 3TB NVMeSSD, current upstream raid0_make_request() will call generic_make_request() 5.7 million times, with this patch only 4 calls to generic_make_request() is required. This patch won't work in real world, because in block/blk-lib.c: __blkdev_issue_discard() the original large bio will be split into smaller ones by restriction of discard_granularity. If some day SSD supports whole device sized discard_granularity, it will be very interesting then... The basic idea is, if a large discard bio received by raid0_make_request(), for example it requests to discard chunk 1 to 24 on a raid0 device built by 4 SSDs. This large discard bio will be split and written to each SSD as the following layout, SSD1: C1,C5,C9,C13,C17,C21 SSD2: C2,C6,C10,C14,C18,C22 SSD3: C3,C7,C11,C15,C19,C23 SSD4: C4,C8,C12,C16,C20,C24 Current raid0 code will call generic_make_request() for 24 times for each split bio. But it is possible to calculate the final layout of each split bio, so we can combine all the bios into four per-SSD large bio, like this, bio1 (on SSD1): C{1,5,9,13,17,21} bio2 (on SSD2): C{2,6,10,14,18,22} bio3 (on SSD3): C{3,7,11,15,19,23} bio4 (on SSD4): C{4,8,12,16,20,24} Now we only need to call generic_make_request() for 4 times. The code is not simple, I need more time to write text to complain how it works. Currently you can treat it as a proof of concept. Changelogs v1, Initial prototype. v2, Major changes inlcude, - rename function names, now handle_discard_bio() takes care in chunk size DISCARD bio and single disk sutiation, large DISCARD bio will be handled in handle_large_discard_bio(). - Set max_discard_sectors to raid0 device size. - Fix several bugs which I find in basic testing.. Signed-off-by: Coly Li <colyli@xxxxxxx> --- drivers/md/raid0.c | 267 ++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 266 insertions(+), 1 deletion(-) diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index 258986a..c7afe0c 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -378,7 +378,7 @@ static int raid0_run(struct mddev *mddev) blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors); blk_queue_max_write_same_sectors(mddev->queue, mddev->chunk_sectors); - blk_queue_max_discard_sectors(mddev->queue, mddev->chunk_sectors); + blk_queue_max_discard_sectors(mddev->queue, raid0_size(mddev, 0, 0)); blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9); blk_queue_io_opt(mddev->queue, @@ -452,6 +452,266 @@ static inline int is_io_in_chunk_boundary(struct mddev *mddev, } } + +struct bio_record { + sector_t bi_sector; + unsigned long sectors; + struct md_rdev *rdev; +}; + +static void handle_large_discard_bio(struct mddev *mddev, struct bio *bio) +{ + struct bio_record *recs = NULL; + struct bio *split; + struct r0conf *conf = mddev->private; + sector_t sectors, sector; + struct strip_zone *first_zone; + int zone_idx; + sector_t zone_start, zone_end; + int nr_strip_zones = conf->nr_strip_zones; + int disks; + int first_rdev_idx = -1, rdev_idx; + struct md_rdev *first_rdev; + unsigned int chunk_sects = mddev->chunk_sectors; + + sector = bio->bi_iter.bi_sector; + first_zone = find_zone(conf, §or); + first_rdev = map_sector(mddev, first_zone, sector, §or); + + /* bio is large enough to be split, allocate recs firstly */ + disks = mddev->raid_disks; + recs = kcalloc(disks, sizeof(struct bio_record), GFP_NOIO); + if (recs == NULL) { + printk(KERN_ERR "md/raid0:%s: failed to allocate memory " \ + "for bio_record", mdname(mddev)); + bio->bi_error = -ENOMEM; + bio_endio(bio); + return; + } + + zone_idx = first_zone - conf->strip_zone; + for (rdev_idx = 0; rdev_idx < first_zone->nb_dev; rdev_idx++) { + struct md_rdev *rdev; + + rdev = conf->devlist[zone_idx * disks + rdev_idx]; + recs[rdev_idx].rdev = rdev; + if (rdev == first_rdev) + first_rdev_idx = rdev_idx; + } + + sectors = chunk_sects - + (likely(is_power_of_2(chunk_sects)) + ? (sector & (chunk_sects - 1)) + : sector_div(sector, chunk_sects)); + sector = bio->bi_iter.bi_sector; + + recs[first_rdev_idx].bi_sector = sector + first_zone->dev_start; + recs[first_rdev_idx].sectors = sectors; + + /* recs[first_rdev_idx] is initialized with 'sectors', we need to + * handle the rested sectors, which is sotred in 'sectors' too. + */ + sectors = bio_sectors(bio) - sectors; + + /* bio may not be chunk size aligned, the split bio on first rdev + * may not be chunk size aligned too. But the rested split bios + * on rested rdevs must be chunk size aligned, and aligned to + * round down chunk number. + */ + zone_end = first_zone->zone_end; + rdev_idx = first_rdev_idx + 1; + sector = likely(is_power_of_2(chunk_sects)) + ? sector & (~(chunk_sects - 1)) + : chunk_sects * (sector/chunk_sects); + + while (rdev_idx < first_zone->nb_dev) { + if (recs[rdev_idx].sectors == 0) { + recs[rdev_idx].bi_sector = sector + first_zone->dev_start; + if (sectors <= chunk_sects) { + recs[rdev_idx].sectors = sectors; + goto issue; + } + recs[rdev_idx].sectors = chunk_sects; + sectors -= chunk_sects; + } + rdev_idx++; + } + + sector += chunk_sects; + zone_start = sector + first_zone->dev_start; + if (zone_start == zone_end) { + zone_idx++; + if (zone_idx == nr_strip_zones) { + if (sectors != 0) + printk(KERN_INFO "bio size exceeds raid0 " \ + "capability, ignore extra " \ + "TRIM/DISCARD range.\n"); + goto issue; + } + zone_start = conf->strip_zone[zone_idx].dev_start; + } + + while (zone_idx < nr_strip_zones) { + int rdevs_in_zone = conf->strip_zone[zone_idx].nb_dev; + int chunks_per_rdev, rested_chunks, rested_sectors; + sector_t zone_sectors, grow_sectors; + int add_rested_sectors = 0; + + zone_end = conf->strip_zone[zone_idx].zone_end; + zone_sectors = zone_end - zone_start; + chunks_per_rdev = sectors; + rested_sectors = + sector_div(chunks_per_rdev, chunk_sects * rdevs_in_zone); + rested_chunks = rested_sectors; + rested_sectors = sector_div(rested_chunks, chunk_sects); + + if ((chunks_per_rdev * chunk_sects) > zone_sectors) + chunks_per_rdev = zone_sectors/chunk_sects; + + /* rested_chunks and rested_sectors go into next zone, we won't + * handle them in this zone. Set them to 0. + */ + if ((chunks_per_rdev * chunk_sects) == zone_sectors && + (rested_chunks != 0 || rested_sectors != 0)) { + if (rested_chunks != 0) + rested_chunks = 0; + if (rested_sectors != 0) + rested_sectors = 0; + } + + if (rested_chunks == 0 && rested_sectors != 0) + add_rested_sectors ++; + + for (rdev_idx = 0; rdev_idx < rdevs_in_zone; rdev_idx++) { + /* if .sectors is not initailized (== 0), it indicates + * .bi_sector is not initialized neither. We initiate + * .bi_sector firstly, then set .sectors by + * grow_sectors. + */ + if (recs[rdev_idx].sectors == 0) + recs[rdev_idx].bi_sector = zone_start; + grow_sectors = chunks_per_rdev * chunk_sects; + if (rested_chunks) { + grow_sectors += chunk_sects; + rested_chunks--; + if (rested_chunks == 0 && + rested_sectors != 0) { + recs[rdev_idx].sectors += grow_sectors; + sectors -= grow_sectors; + add_rested_sectors ++; + continue; + } + } + + /* if add_rested_sectors != 0, it indicates + * rested_sectors != 0 + */ + if (add_rested_sectors == 1) { + grow_sectors += rested_sectors; + add_rested_sectors ++; + } + recs[rdev_idx].sectors += grow_sectors; + sectors -= grow_sectors; + if (sectors == 0) + break; + } + + if (sectors == 0) + break; + zone_start = zone_end; + zone_idx++; + if (zone_idx < nr_strip_zones) + BUG_ON(zone_start != conf->strip_zone[zone_idx].dev_start); + } + + +issue: + /* recs contains the re-ordered requests layout, now we can + * chain split bios from recs + */ + for (rdev_idx = 0; rdev_idx < disks; rdev_idx++) { + if (rdev_idx == first_rdev_idx || + recs[rdev_idx].sectors == 0) + continue; + split = bio_split(bio, + recs[rdev_idx].sectors, + GFP_NOIO, + fs_bio_set); + if (split == NULL) + break; + bio_chain(split, bio); + BUG_ON(split->bi_iter.bi_size != recs[rdev_idx].sectors << 9); + split->bi_bdev = recs[rdev_idx].rdev->bdev; + split->bi_iter.bi_sector = recs[rdev_idx].bi_sector + + recs[rdev_idx].rdev->data_offset; + + if (unlikely(!blk_queue_discard( + bdev_get_queue(split->bi_bdev)))) + /* Just ignore it */ + bio_endio(split); + else + generic_make_request(split); + } + BUG_ON(bio->bi_iter.bi_size != recs[first_rdev_idx].sectors << 9); + bio->bi_iter.bi_sector = recs[first_rdev_idx].bi_sector + + recs[first_rdev_idx].rdev->data_offset; + bio->bi_bdev = recs[first_rdev_idx].rdev->bdev; + + if (unlikely(!blk_queue_discard(bdev_get_queue(bio->bi_bdev)))) + /* Just ignore it */ + bio_endio(bio); + else + generic_make_request(bio); + + kfree(recs); +} + +static void handle_discard_bio(struct mddev *mddev, struct bio *bio) +{ + struct r0conf *conf = mddev->private; + unsigned int chunk_sects = mddev->chunk_sectors; + sector_t sector, sectors; + struct md_rdev *rdev; + struct strip_zone *zone; + + sector = bio->bi_iter.bi_sector; + zone = find_zone(conf, §or); + rdev = map_sector(mddev, zone, sector, §or); + bio->bi_bdev = rdev->bdev; + sectors = chunk_sects - + (likely(is_power_of_2(chunk_sects)) + ? (sector & (chunk_sects - 1)) + : sector_div(sector, chunk_sects)); + + if (unlikely(sectors >= bio_sectors(bio))) { + bio->bi_iter.bi_sector = sector + zone->dev_start + + rdev->data_offset; + goto single_bio; + } + + if (unlikely(zone->nb_dev == 1)) { + sectors = conf->strip_zone[0].zone_end - + sector; + if (bio_sectors(bio) > sectors) + bio->bi_iter.bi_size = sectors << 9; + bio->bi_iter.bi_sector = sector + rdev->data_offset; + goto single_bio; + } + + handle_large_discard_bio(mddev, bio); + return; + +single_bio: + if (unlikely(!blk_queue_discard(bdev_get_queue(bio->bi_bdev)))) + /* Just ignore it */ + bio_endio(bio); + else + generic_make_request(bio); + + return; +} + + static void raid0_make_request(struct mddev *mddev, struct bio *bio) { struct strip_zone *zone; @@ -463,6 +723,11 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio) return; } + if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) { + handle_discard_bio(mddev, bio); + return; + } + do { sector_t sector = bio->bi_iter.bi_sector; unsigned chunk_sects = mddev->chunk_sectors;