Re: raid0 vs. mkfs

Coly Li <colyli@xxxxxxx> · Fri, 9 Dec 2016 00:44:57 +0800

On 2016/12/8 上午12:59, Shaohua Li wrote:
> On Wed, Dec 07, 2016 at 07:50:33PM +0800, Coly Li wrote:
[snip]
> Thanks for doing this, Coly! For raid0, this totally makes sense. The raid0
> zones make things a little complicated though. I just had a brief look of your
> proposed patch, which looks really complicated. I'd suggest something like
> this:
> 1. split the bio according to zone boundary.
> 2. handle the splitted bio. since the bio is within zone range, calculating
> the start and end sector for each rdev should be easy.
> 

Hi Shaohua,

Thanks for your suggestion! I try to modify the code by your suggestion,
it is even more hard to make the code that way ...

Because even split bios for each zone, all the corner cases still exist
and should be taken care in every zoon. The code will be more complicated.


> This will create slightly more bio to each rdev (not too many, since there
> aren't too many zones in practice) and block layer should easily merge these
> bios without much overhead. The benefit is a much simpler implementation.
> 
>> I compose a prototype patch, the code is not simple, indeed it is quite
>> complicated IMHO.
>>
>> I do a little research, some NVMe SSDs support whole device size
>> DISCARD, also I observe mkfs.xfs sends out a raid0 device size DISCARD
>> bio to block layer. But raid0_make_request() only receives 512KB size
>> DISCARD bio, block/blk-lib.c:__blkdev_issue_discard() splits the
>> original large bio into 512KB small bios, the limitation is from
>> q->limits.discard_granularity.
> 
> please adjust the max discard sectors for the queue. The original setting is
> chunk size.

This is a powerful suggestion, I change the max_discard_sectors to raid0
size, and fix some bugs, now the patch looks working well. The
performance number is not bad.

On 4x3TB NVMe raid0, format it with mkfs.xfs. Current upstream kernel
spends 306 seconds, the patched kernel spends 15 seconds. I see average
request size increases from 1 chunk (1024 sectors) to 2048 chunks
(2097152 sectors).

I don't know why the bios are still be split before raid0_make_request()
receives them, after I set q->limits.max_discard_sectors to
mddev->array_sectors. Can anybody give me a hint ?

Here I attach the RFC v2 patch, if anybody wants to try it, please do it
and response the result :-)

I will take time to write a very detailed commit log and code comments
to make this patch more easier to be understood. Ugly code, that's what
I have to pay to gain better performance ....

Thanks in advance.

Coly




Subject: [RFC v2] optimization for large size DISCARD bio by per-device bios 

This is a very early prototype, still needs more block layer code
modification to make it work.

Current upstream raid0_make_request() only handles TRIM/DISCARD bio by
chunk size, it meams for large raid0 device built by SSDs will call
million times generic_make_request() for the split bio. This patch
tries to combine small bios into large one if they are on same real
device and continuous on this real device, then send the combined large
bio to underlying device by single call to generic_make_request().

For example, use mkfs.xfs to trim a raid0 device built with 4 x 3TB
NVMeSSD, current upstream raid0_make_request() will call
generic_make_request() 5.7 million times, with this patch only 4 calls
to generic_make_request() is required.

This patch won't work in real world, because in block/blk-lib.c:
__blkdev_issue_discard() the original large bio will be split into
smaller ones by restriction of discard_granularity.

If some day SSD supports whole device sized discard_granularity, it
will be very interesting then...

The basic idea is, if a large discard bio received
by raid0_make_request(), for example it requests to discard chunk 1
to 24 on a raid0 device built by 4 SSDs. This large discard bio will
be split and written to each SSD as the following layout,
	SSD1: C1,C5,C9,C13,C17,C21
	SSD2: C2,C6,C10,C14,C18,C22
	SSD3: C3,C7,C11,C15,C19,C23
	SSD4: C4,C8,C12,C16,C20,C24
Current raid0 code will call generic_make_request() for 24 times for
each split bio. But it is possible to calculate the final layout of
each split bio, so we can combine all the bios into four per-SSD large
bio, like this,
	bio1 (on SSD1): C{1,5,9,13,17,21}
	bio2 (on SSD2): C{2,6,10,14,18,22}
	bio3 (on SSD3): C{3,7,11,15,19,23}
	bio4 (on SSD4): C{4,8,12,16,20,24}
Now we only need to call generic_make_request() for 4 times.

The code is not simple, I need more time to write text to complain how
it works. Currently you can treat it as a proof of concept.

Changelogs
v1, Initial prototype.
v2, Major changes inlcude,
    - rename function names, now handle_discard_bio() takes care
      in chunk size DISCARD bio and single disk sutiation, large DISCARD
      bio will be handled in handle_large_discard_bio().
    - Set max_discard_sectors to raid0 device size.
    - Fix several bugs which I find in basic testing..

Signed-off-by: Coly Li <colyli@xxxxxxx>
---
 drivers/md/raid0.c | 267 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 266 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 258986a..c7afe0c 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -378,7 +378,7 @@ static int raid0_run(struct mddev *mddev)
 
 		blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
 		blk_queue_max_write_same_sectors(mddev->queue, mddev->chunk_sectors);
-		blk_queue_max_discard_sectors(mddev->queue, mddev->chunk_sectors);
+		blk_queue_max_discard_sectors(mddev->queue, raid0_size(mddev, 0, 0));
 
 		blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9);
 		blk_queue_io_opt(mddev->queue,
@@ -452,6 +452,266 @@ static inline int is_io_in_chunk_boundary(struct mddev *mddev,
 	}
 }
 
+
+struct bio_record {
+	sector_t	bi_sector;
+	unsigned long	sectors;
+	struct md_rdev	*rdev;
+};
+
+static void handle_large_discard_bio(struct mddev *mddev, struct bio *bio)
+{
+	struct bio_record *recs = NULL;
+	struct bio *split;
+	struct r0conf *conf = mddev->private;
+	sector_t sectors, sector;
+	struct strip_zone *first_zone;
+	int zone_idx;
+	sector_t zone_start, zone_end;
+	int nr_strip_zones = conf->nr_strip_zones;
+	int disks;
+	int first_rdev_idx = -1, rdev_idx;
+	struct md_rdev *first_rdev;
+	unsigned int chunk_sects = mddev->chunk_sectors;
+
+	sector = bio->bi_iter.bi_sector;
+	first_zone = find_zone(conf, &sector);
+	first_rdev = map_sector(mddev, first_zone, sector, &sector);
+
+	/* bio is large enough to be split, allocate recs firstly */
+	disks = mddev->raid_disks;
+	recs = kcalloc(disks, sizeof(struct bio_record), GFP_NOIO);
+	if (recs == NULL) {
+		printk(KERN_ERR "md/raid0:%s: failed to allocate memory " \
+				"for bio_record", mdname(mddev));
+		bio->bi_error = -ENOMEM;
+		bio_endio(bio);
+		return;
+	}
+
+	zone_idx = first_zone - conf->strip_zone;
+	for (rdev_idx = 0; rdev_idx < first_zone->nb_dev; rdev_idx++) {
+		struct md_rdev *rdev;
+
+		rdev = conf->devlist[zone_idx * disks + rdev_idx];
+		recs[rdev_idx].rdev = rdev;
+		if (rdev == first_rdev)
+			first_rdev_idx = rdev_idx;
+	}
+
+	sectors = chunk_sects -
+		(likely(is_power_of_2(chunk_sects))
+		? (sector & (chunk_sects - 1))
+		: sector_div(sector, chunk_sects));
+		sector = bio->bi_iter.bi_sector;
+
+	recs[first_rdev_idx].bi_sector = sector + first_zone->dev_start;
+	recs[first_rdev_idx].sectors = sectors;
+
+	/* recs[first_rdev_idx] is initialized with 'sectors', we need to
+	 * handle the rested sectors, which is sotred in 'sectors' too.
+	 */
+	sectors = bio_sectors(bio) - sectors;
+
+	/* bio may not be chunk size aligned, the split bio on first rdev
+	 * may not be chunk size aligned too. But the rested split bios
+	 * on rested rdevs must be chunk size aligned, and aligned to
+	 * round down chunk number.
+	 */
+	zone_end = first_zone->zone_end;
+	rdev_idx = first_rdev_idx + 1;
+	sector = likely(is_power_of_2(chunk_sects))
+		 ? sector & (~(chunk_sects - 1))
+		 : chunk_sects * (sector/chunk_sects);
+
+	while (rdev_idx < first_zone->nb_dev) {
+		if (recs[rdev_idx].sectors == 0) {
+			recs[rdev_idx].bi_sector = sector + first_zone->dev_start;
+			if (sectors <= chunk_sects) {
+				recs[rdev_idx].sectors = sectors;
+				goto issue;
+			}
+			recs[rdev_idx].sectors = chunk_sects;
+			sectors -= chunk_sects;
+		}
+		rdev_idx++;
+	}
+
+	sector += chunk_sects;
+	zone_start = sector + first_zone->dev_start;
+	if (zone_start == zone_end) {
+		zone_idx++;
+		if (zone_idx == nr_strip_zones) {
+			if (sectors != 0)
+				printk(KERN_INFO "bio size exceeds raid0 " \
+					"capability, ignore extra " \
+					"TRIM/DISCARD range.\n");
+			goto issue;
+		}
+		zone_start = conf->strip_zone[zone_idx].dev_start;
+	}
+
+	while (zone_idx < nr_strip_zones) {
+		int rdevs_in_zone = conf->strip_zone[zone_idx].nb_dev;
+		int chunks_per_rdev, rested_chunks, rested_sectors;
+		sector_t zone_sectors, grow_sectors;
+		int add_rested_sectors = 0;
+
+		zone_end = conf->strip_zone[zone_idx].zone_end;
+		zone_sectors = zone_end - zone_start;
+		chunks_per_rdev = sectors;
+		rested_sectors =
+			sector_div(chunks_per_rdev, chunk_sects * rdevs_in_zone);
+		rested_chunks = rested_sectors;
+		rested_sectors = sector_div(rested_chunks, chunk_sects);
+
+		if ((chunks_per_rdev * chunk_sects) > zone_sectors)
+			chunks_per_rdev = zone_sectors/chunk_sects;
+
+		/* rested_chunks and rested_sectors go into next zone, we won't
+		 * handle them in this zone. Set them to 0.
+		 */
+		if ((chunks_per_rdev * chunk_sects) == zone_sectors &&
+		    (rested_chunks != 0 || rested_sectors != 0)) {
+			if (rested_chunks != 0)
+				rested_chunks = 0;
+			if (rested_sectors != 0)
+				rested_sectors = 0;
+		}
+
+		if (rested_chunks == 0 && rested_sectors != 0)
+			add_rested_sectors ++;
+
+		for (rdev_idx = 0; rdev_idx < rdevs_in_zone; rdev_idx++) {
+			/* if .sectors is not initailized (== 0), it indicates
+			 * .bi_sector is not initialized neither. We initiate
+			 * .bi_sector firstly, then set .sectors by
+			 * grow_sectors.
+			 */
+			if (recs[rdev_idx].sectors == 0)
+				recs[rdev_idx].bi_sector = zone_start;
+			grow_sectors = chunks_per_rdev * chunk_sects;
+			if (rested_chunks) {
+				grow_sectors += chunk_sects;
+				rested_chunks--;
+				if (rested_chunks == 0 &&
+				    rested_sectors != 0) {
+					recs[rdev_idx].sectors += grow_sectors;
+					sectors -= grow_sectors;
+					add_rested_sectors ++;
+					continue;
+				}
+			}
+
+			/* if add_rested_sectors != 0, it indicates
+			 * rested_sectors != 0
+			 */
+			if (add_rested_sectors == 1) {
+				grow_sectors += rested_sectors;
+				add_rested_sectors ++;
+			}
+			recs[rdev_idx].sectors += grow_sectors;
+			sectors -= grow_sectors;
+			if (sectors == 0)
+				break;
+		}
+
+		if (sectors == 0)
+			break;
+		zone_start = zone_end;
+		zone_idx++;
+		if (zone_idx < nr_strip_zones)
+			BUG_ON(zone_start != conf->strip_zone[zone_idx].dev_start);
+	}
+
+
+issue:
+	/* recs contains the re-ordered requests layout, now we can
+	 * chain split bios from recs
+	 */
+	for (rdev_idx = 0; rdev_idx < disks; rdev_idx++) {
+		if (rdev_idx == first_rdev_idx ||
+		    recs[rdev_idx].sectors == 0)
+			continue;
+		split = bio_split(bio,
+				  recs[rdev_idx].sectors,
+				  GFP_NOIO,
+				  fs_bio_set);
+		if (split == NULL)
+			break;
+		bio_chain(split, bio);
+		BUG_ON(split->bi_iter.bi_size != recs[rdev_idx].sectors << 9);
+		split->bi_bdev = recs[rdev_idx].rdev->bdev;
+		split->bi_iter.bi_sector = recs[rdev_idx].bi_sector +
+				recs[rdev_idx].rdev->data_offset;
+
+		if (unlikely(!blk_queue_discard(
+				bdev_get_queue(split->bi_bdev))))
+			/* Just ignore it */
+			bio_endio(split);
+		else
+			generic_make_request(split);
+	}
+	BUG_ON(bio->bi_iter.bi_size != recs[first_rdev_idx].sectors << 9);
+	bio->bi_iter.bi_sector = recs[first_rdev_idx].bi_sector +
+					recs[first_rdev_idx].rdev->data_offset;
+	bio->bi_bdev = recs[first_rdev_idx].rdev->bdev;
+
+	if (unlikely(!blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
+		/* Just ignore it */
+		bio_endio(bio);
+	else
+		generic_make_request(bio);
+
+	kfree(recs);
+}
+
+static void handle_discard_bio(struct mddev *mddev, struct bio *bio)
+{
+	struct r0conf *conf = mddev->private;
+	unsigned int chunk_sects = mddev->chunk_sectors;
+	sector_t sector, sectors;
+	struct md_rdev *rdev;
+	struct strip_zone *zone;
+
+	sector = bio->bi_iter.bi_sector;
+	zone = find_zone(conf, &sector);
+	rdev = map_sector(mddev, zone, sector, &sector);
+	bio->bi_bdev = rdev->bdev;
+	sectors = chunk_sects -
+		(likely(is_power_of_2(chunk_sects))
+		 ? (sector & (chunk_sects - 1))
+		 : sector_div(sector, chunk_sects));
+
+	if (unlikely(sectors >= bio_sectors(bio))) {
+		bio->bi_iter.bi_sector = sector + zone->dev_start +
+					 rdev->data_offset;
+		goto single_bio;
+	}
+
+	if (unlikely(zone->nb_dev == 1)) {
+		sectors = conf->strip_zone[0].zone_end -
+			  sector;
+		if (bio_sectors(bio) > sectors)
+			bio->bi_iter.bi_size = sectors << 9;
+		bio->bi_iter.bi_sector = sector + rdev->data_offset;
+		goto single_bio;
+	}
+
+	handle_large_discard_bio(mddev, bio);
+	return;
+
+single_bio:
+	if (unlikely(!blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
+		/* Just ignore it */
+		bio_endio(bio);
+	else
+		generic_make_request(bio);
+
+	return;
+}
+
+
 static void raid0_make_request(struct mddev *mddev, struct bio *bio)
 {
 	struct strip_zone *zone;
@@ -463,6 +723,11 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
 		return;
 	}
 
+	if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) {
+		handle_discard_bio(mddev, bio);
+		return;
+	}
+
 	do {
 		sector_t sector = bio->bi_iter.bi_sector;
 		unsigned chunk_sects = mddev->chunk_sectors;