On Wed, 30 May 2012 15:03:16 +0200 Sebastian Riemer <sebastian.riemer@xxxxxxxxxxxxxxxx> wrote: > On 29/05/12 12:25, NeilBrown wrote: > > On Tue, 29 May 2012 11:30:27 +0200 Sebastian Riemer > > <sebastian.riemer@xxxxxxxxxxxxxxxx> wrote: > >> Now, I've updated mdadm to version 3.2.5 and it works like you've > >> described it. Thanks for your help! But the buffered IO is what matters. > >> 4k isn't enough there. Please inform me about changes which increase the > >> size in buffered IO. I'll have a look at this, too. > > > > I don't know. I'd have to dive into the code and look around and put a few > > printks in to see what is happening. > > Now, I've configured a storage server with real HDDs for testing the > cached IO with kernel 3.4. Here direct IO always doesn't work > (Input/Output error with dd/fio). And cached IO is totally slow. My > RAID0 devices are md100 and md200. The RAID1 on top is the md300. > > The md100 is reported as "faulty spare" and this has hit the following a > kernel bug. > > This is the debug output: > > md/raid0:md100: make_request bug: can't convert block across chunks or > bigger than 512k 541312 320 > md/raid0:md200: make_request bug: can't convert block across chunks or > bigger than 512k 541312 320 > md/raid1:md300: Disk failure on md100, disabling device. > md/raid1:md300: Operation continuing on 1 devices. > RAID1 conf printout: > --- wd:1 rd:2 > disk 0, wo:1, o:0, dev:md100 > disk 1, wo:0, o:1, dev:md200 > RAID1 conf printout: > --- wd:1 rd:2 > disk 1, wo:0, o:1, dev:md200 > md/raid0:md200: make_request bug: can't convert block across chunks or > bigger than 512k 2704000 320 > > The chunk size of 320 KiB comes from max_sectors_kb of the LSI HW RAID > controller where the drives are passed through as single drive RAID0 > logical devices. I guess this is a problem for MD RAID0 underneath the > RAID1, because this doesn't fit as a multiple of the 512 KiB stripe size. Hmmm... that's bad. Looks like I have a bug .... yes I do. Patch below fixes it. If you could test and confirm I would appreciated it. As for the cached writes being always 4K - are you writing through a filesystem or directly to /dev/md300?? If the former it is a bug in that filesystem. If the later, it is a bug in fs/block_dev.c In particular, fs/block_dev.c uses "generic_writepages" for the "writepages" method rather than "mpage_writepages" (or a wrapper which calls it with appropriate args). 'generic_writepages' simply calls ->writepage on each dirty page. mpage_writepages (used e.g. by ext2) collects multiple pages into a single bio. The elevator at the device level should still collect these 1-page bios into larger requests, but I guess that has higher CPU overhead. thanks for the report. NeilBrown From dd47a247ae226896205f753ad246cd40141aadf1 Mon Sep 17 00:00:00 2001 From: NeilBrown <neilb@xxxxxxx> Date: Thu, 31 May 2012 15:39:11 +1000 Subject: [PATCH] md: raid1/raid10: fix problem with merge_bvec_fn The new merge_bvec_fn which calls the corresponding function in subsidiary devices requires that mddev->merge_check_needed be set if any child has a merge_bvec_fn. However were were only setting that when a device was hot-added, not when a device was present from the start. This bug was introduced in 3.4 so patch is suitable for 3.4.y kernels. Cc: stable@xxxxxxxxxxxxxxx Reported-by: Sebastian Riemer <sebastian.riemer@xxxxxxxxxxxxxxxx> Signed-off-by: NeilBrown <neilb@xxxxxxx> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 15dd59b..d7e9577 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -2548,6 +2548,7 @@ static struct r1conf *setup_conf(struct mddev *mddev) err = -EINVAL; spin_lock_init(&conf->device_lock); rdev_for_each(rdev, mddev) { + struct request_queue *q; int disk_idx = rdev->raid_disk; if (disk_idx >= mddev->raid_disks || disk_idx < 0) @@ -2560,6 +2561,9 @@ static struct r1conf *setup_conf(struct mddev *mddev) if (disk->rdev) goto abort; disk->rdev = rdev; + q = bdev_get_queue(rdev->bdev); + if (q->merge_bvec_fn) + mddev->merge_check_needed = 1; disk->head_position = 0; } diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 3f91c2e..d037adb 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -3311,7 +3311,7 @@ static int run(struct mddev *mddev) (conf->raid_disks / conf->near_copies)); rdev_for_each(rdev, mddev) { - + struct request_queue *q; disk_idx = rdev->raid_disk; if (disk_idx >= conf->raid_disks || disk_idx < 0) @@ -3327,6 +3327,9 @@ static int run(struct mddev *mddev) goto out_free_conf; disk->rdev = rdev; } + q = bdev_get_queue(rdev->bdev); + if (q->merge_bvec_fn) + mddev->merge_check_needed = 1; disk_stack_limits(mddev->gendisk, rdev->bdev, rdev->data_offset << 9);
Attachment:
signature.asc
Description: PGP signature