Re: Reason for md raid 01 blksize limited to 4 KiB?

NeilBrown <neilb@xxxxxxx> · Thu, 31 May 2012 15:42:56 +1000

On Wed, 30 May 2012 15:03:16 +0200 Sebastian Riemer
<sebastian.riemer@xxxxxxxxxxxxxxxx> wrote:

> On 29/05/12 12:25, NeilBrown wrote:
> > On Tue, 29 May 2012 11:30:27 +0200 Sebastian Riemer
> > <sebastian.riemer@xxxxxxxxxxxxxxxx> wrote:
> >> Now, I've updated mdadm to version 3.2.5 and it works like you've
> >> described it. Thanks for your help! But the buffered IO is what matters.
> >> 4k isn't enough there. Please inform me about changes which increase the
> >> size in buffered IO. I'll have a look at this, too.
> > 
> > I don't know.  I'd have to dive into the code and look around and put a few
> > printks in to see what is happening.
> 
> Now, I've configured a storage server with real HDDs for testing the
> cached IO with kernel 3.4. Here direct IO always doesn't work
> (Input/Output error with dd/fio). And cached IO is totally slow. My
> RAID0 devices are md100 and md200. The RAID1 on top is the md300.
> 
> The md100 is reported as "faulty spare" and this has hit the following a
> kernel bug.
> 
> This is the debug output:
> 
> md/raid0:md100: make_request bug: can't convert block across chunks or
> bigger than 512k 541312 320
> md/raid0:md200: make_request bug: can't convert block across chunks or
> bigger than 512k 541312 320
> md/raid1:md300: Disk failure on md100, disabling device.
> md/raid1:md300: Operation continuing on 1 devices.
> RAID1 conf printout:
> --- wd:1 rd:2
> disk 0, wo:1, o:0, dev:md100
> disk 1, wo:0, o:1, dev:md200
> RAID1 conf printout:
> --- wd:1 rd:2
> disk 1, wo:0, o:1, dev:md200
> md/raid0:md200: make_request bug: can't convert block across chunks or
> bigger than 512k 2704000 320
> 
> The chunk size of 320 KiB comes from max_sectors_kb of the LSI HW RAID
> controller where the drives are passed through as single drive RAID0
> logical devices. I guess this is a problem for MD RAID0 underneath the
> RAID1, because this doesn't fit as a multiple of the 512 KiB stripe size.

Hmmm... that's bad.  Looks like I have a bug .... yes I do.  Patch below
fixes it.  If you could test and confirm I would appreciated it.

As for the cached writes being always 4K - are you writing through a
filesystem or directly to /dev/md300??

If the former it is a bug in that filesystem.
If the later, it is a bug in fs/block_dev.c
In particular, fs/block_dev.c uses "generic_writepages" for the
"writepages" method rather than "mpage_writepages" (or a wrapper which
calls it with appropriate args).

'generic_writepages' simply calls ->writepage on each dirty page.
mpage_writepages (used e.g. by ext2) collects multiple pages into
a single bio.

The elevator at the device level should still collect these 1-page bios into
larger requests, but I guess that has higher CPU overhead.

thanks for the report.

NeilBrown

From dd47a247ae226896205f753ad246cd40141aadf1 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@xxxxxxx>
Date: Thu, 31 May 2012 15:39:11 +1000
Subject: [PATCH] md: raid1/raid10: fix problem with merge_bvec_fn

The new merge_bvec_fn which calls the corresponding function
in subsidiary devices requires that mddev->merge_check_needed
be set if any child has a merge_bvec_fn.

However were were only setting that when a device was hot-added,
not when a device was present from the start.

This bug was introduced in 3.4 so patch is suitable for 3.4.y
kernels.

Cc: stable@xxxxxxxxxxxxxxx
Reported-by: Sebastian Riemer <sebastian.riemer@xxxxxxxxxxxxxxxx>
Signed-off-by: NeilBrown <neilb@xxxxxxx>

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 15dd59b..d7e9577 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2548,6 +2548,7 @@ static struct r1conf *setup_conf(struct mddev *mddev)
 	err = -EINVAL;
 	spin_lock_init(&conf->device_lock);
 	rdev_for_each(rdev, mddev) {
+		struct request_queue *q;
 		int disk_idx = rdev->raid_disk;
 		if (disk_idx >= mddev->raid_disks
 		    || disk_idx < 0)
@@ -2560,6 +2561,9 @@ static struct r1conf *setup_conf(struct mddev *mddev)
 		if (disk->rdev)
 			goto abort;
 		disk->rdev = rdev;
+		q = bdev_get_queue(rdev->bdev);
+		if (q->merge_bvec_fn)
+			mddev->merge_check_needed = 1;
 
 		disk->head_position = 0;
 	}
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 3f91c2e..d037adb 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3311,7 +3311,7 @@ static int run(struct mddev *mddev)
 				 (conf->raid_disks / conf->near_copies));
 
 	rdev_for_each(rdev, mddev) {
-
+		struct request_queue *q;
 		disk_idx = rdev->raid_disk;
 		if (disk_idx >= conf->raid_disks
 		    || disk_idx < 0)
@@ -3327,6 +3327,9 @@ static int run(struct mddev *mddev)
 				goto out_free_conf;
 			disk->rdev = rdev;
 		}
+		q = bdev_get_queue(rdev->bdev);
+		if (q->merge_bvec_fn)
+			mddev->merge_check_needed = 1;
 
 		disk_stack_limits(mddev->gendisk, rdev->bdev,
 				  rdev->data_offset << 9);
Attachment:
signature.asc

Description: PGP signature