sadly the patch didn't help , sadly, the patch didn't help at all, see following Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 2042.00 0.00 345.00 0.00 64112.00 185.83 93.13 93.36 2.12 73.00 sdd 0.00 1704.00 7.00 156.00 56.00 12496.00 77.01 95.71 146.20 3.62 59.00 sdc 0.00 1518.00 16.00 185.00 128.00 9936.00 50.07 98.20 157.41 3.13 63.00 sde 222.00 1997.00 194.00 189.00 51568.00 16488.00 177.69 81.54 99.09 2.25 86.00 md0 0.00 0.00 37.00 4096.00 296.00 32768.00 8.00 0.00 0.00 0.00 0.00 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 150.00 0.00 194.00 0.00 33336.00 171.84 34.91 492.84 4.59 89.00 sdd 0.00 0.00 0.00 138.00 0.00 3488.00 25.28 32.68 757.75 4.06 56.00 sdc 0.00 0.00 3.00 127.00 24.00 4704.00 36.37 33.68 771.08 4.54 59.00 sde 222.00 0.00 90.00 84.00 39936.00 1672.00 239.13 23.73 386.90 4.08 71.00 md0 0.00 0.00 2.00 0.00 16.00 0.00 8.00 0.00 0.00 0.00 0.00 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 235.00 0.00 188.00 0.00 54024.00 287.36 0.49 3.78 1.65 31.00 sdd 0.00 0.00 27.00 0.00 216.00 0.00 8.00 0.15 5.56 5.56 15.00 sdc 0.00 0.00 46.00 0.00 368.00 0.00 8.00 0.32 6.52 6.96 32.00 sde 165.00 0.00 200.00 0.00 43480.00 0.00 217.40 7.63 38.15 2.00 40.00 md0 0.00 0.00 101.00 0.00 808.00 0.00 8.00 0.00 0.00 0.00 0.00 I poked around and found this, when big flush comes in , Every 1.0s: cat /sys/block/sdb/stat /sys/block/sdc/stat /sys/block/sdd/stat /sys/block/sde/stat /sys/block/md0/stat Wed Dec 7 22:26:14 2011 32 10 336 270 2792623 5501730 783168880 254952160 284 4815060 255014270 2993481 2222268 499586400 94384090 493165 1842192 18671608 271311440 290 9942910 365758660 691727 19 5533896 1507300 501261 1838497 18706544 276987570 262 3254420 278552760 1458797 1404948 281875858 49664210 483386 1841832 18588928 256627020 259 4997270 306348180 2797538 0 22380058 0 4652939 0 37223512 0 0 0 0 Every downstream disk have a Huge in-flight IO jump, where it is usually just 0 or 1 the whole time. The kernel document says this is don't include queued IO, so I think the problem is because IO scheduler issued too many requests to the device , without throttling read/write, that basically saturated the disk, so no other read can be scheduled, do you knwo why this would happen to me? Here's my relevenat scheduler tweak: for disk in /sys/block/sd[bcde] do echo "changing $disk scheduler" echo "deadline" > $disk/queue/scheduler echo "changing $disk nr_reqests to 4096" echo 4096 > $disk/queue/nr_requests echo "setra to 0" echo 0 > $disk/queue/read_ahead_kb echo "tweaking deadline io" echo 32 > $disk/queue/iosched/fifo_batch echo 30 > $disk/queue/iosched/read_expire echo 20000 > $disk/queue/iosched/write_expire echo 256 > $disk/queue/iosched/writes_starved done echo 0 > /sys/block/md0/queue/read_ahead_kb My workload profile is 100% random 8K IO. Come to think of it, the problem is mostly IO scheduling issue, does nr_requests mean anything to MD? it's not possible to adjust it either, was that the reason that MD can't accept more reads? On Wed, Dec 7, 2011 at 4:10 PM, NeilBrown <neilb@xxxxxxx> wrote: > > On Wed, 7 Dec 2011 15:37:30 -0800 Yucong Sun (叶雨飞) <sunyucong@xxxxxxxxx> > wrote: > > > Neil, I can't compile latest MD against 2.6.32, and that commit can't > > be patched into 2.6.32 directly either, can you help me on this? > > > > This should do it. > > NeilBrown > > commit ef54b7cf955dc3b7d33248e8591b1a00b4fa998c > Author: NeilBrown <neilb@xxxxxxx> > Date: Tue Oct 11 16:50:01 2011 +1100 > > md: add proper write-congestion reporting to RAID1 and RAID10. > > RAID1 and RAID10 handle write requests by queuing them for handling by > a separate thread. This is because when a write-intent-bitmap is > active we might need to update the bitmap first, so it is good to > queue a lot of writes, then do one big bitmap update for them all. > > However writeback request devices to appear to be congested after a > while so it can make some guesstimate of throughput. The infinite > queue defeats that (note that RAID5 has already has a finite queue so > it doesn't suffer from this problem). > > So impose a limit on the number of pending write requests. By default > it is 1024 which seems to be generally suitable. Make it configurable > via module option just in case someone finds a regression. > > Signed-off-by: NeilBrown <neilb@xxxxxxx> > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > index e07ce2e..fe7ae3c 100644 > --- a/drivers/md/raid1.c > +++ b/drivers/md/raid1.c > @@ -50,6 +50,11 @@ > */ > #define NR_RAID1_BIOS 256 > > +/* When there are this many requests queue to be written by > + * the raid1 thread, we become 'congested' to provide back-pressure > + * for writeback. > + */ > +static int max_queued_requests = 1024; > > static void unplug_slaves(mddev_t *mddev); > > @@ -576,7 +581,8 @@ static int raid1_congested(void *data, int bits) > conf_t *conf = mddev->private; > int i, ret = 0; > > - if (mddev_congested(mddev, bits)) > + if (mddev_congested(mddev, bits) && > + conf->pending_count >= max_queued_requests) > return 1; > > rcu_read_lock(); > @@ -613,10 +619,12 @@ static int flush_pending_writes(conf_t *conf) > struct bio *bio; > bio = bio_list_get(&conf->pending_bio_list); > blk_remove_plug(conf->mddev->queue); > + conf->pending_count = 0; > spin_unlock_irq(&conf->device_lock); > /* flush any pending bitmap writes to > * disk before proceeding w/ I/O */ > bitmap_unplug(conf->mddev->bitmap); > + wake_up(&conf->wait_barrier); > > while (bio) { /* submit pending writes */ > struct bio *next = bio->bi_next; > @@ -789,6 +797,7 @@ static int make_request(struct request_queue *q, struct bio * bio) > int cpu; > bool do_barriers; > mdk_rdev_t *blocked_rdev; > + int cnt = 0; > > /* > * Register the new request and wait if the reconstruction > @@ -864,6 +873,11 @@ static int make_request(struct request_queue *q, struct bio * bio) > /* > * WRITE: > */ > + if (conf->pending_count >= max_queued_requests) { > + md_wakeup_thread(mddev->thread); > + wait_event(conf->wait_barrier, > + conf->pending_count < max_queued_requests); > + } > /* first select target devices under spinlock and > * inc refcount on their rdev. Record them by setting > * bios[x] to bio > @@ -970,6 +984,7 @@ static int make_request(struct request_queue *q, struct bio * bio) > atomic_inc(&r1_bio->remaining); > > bio_list_add(&bl, mbio); > + cnt++; > } > kfree(behind_pages); /* the behind pages are attached to the bios now */ > > @@ -978,6 +993,7 @@ static int make_request(struct request_queue *q, struct bio * bio) > spin_lock_irqsave(&conf->device_lock, flags); > bio_list_merge(&conf->pending_bio_list, &bl); > bio_list_init(&bl); > + conf->pending_count += cnt; > > blk_plug_device(mddev->queue); > spin_unlock_irqrestore(&conf->device_lock, flags); > @@ -2021,7 +2037,7 @@ static int run(mddev_t *mddev) > > bio_list_init(&conf->pending_bio_list); > bio_list_init(&conf->flushing_bio_list); > - > + conf->pending_count = 0; > > mddev->degraded = 0; > for (i = 0; i < conf->raid_disks; i++) { > @@ -2317,3 +2333,5 @@ MODULE_LICENSE("GPL"); > MODULE_ALIAS("md-personality-3"); /* RAID1 */ > MODULE_ALIAS("md-raid1"); > MODULE_ALIAS("md-level-1"); > + > +module_param(max_queued_requests, int, S_IRUGO|S_IWUSR); > diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h > index e87b84d..520288c 100644 > --- a/drivers/md/raid1.h > +++ b/drivers/md/raid1.h > @@ -38,6 +38,7 @@ struct r1_private_data_s { > /* queue of writes that have been unplugged */ > struct bio_list flushing_bio_list; > > + int pending_count; > /* for use when syncing mirrors: */ > > spinlock_t resync_lock; > diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c > index c2cb7b8..4c7d9b5 100644 > --- a/drivers/md/raid10.c > +++ b/drivers/md/raid10.c > @@ -59,6 +59,11 @@ static void unplug_slaves(mddev_t *mddev); > > static void allow_barrier(conf_t *conf); > static void lower_barrier(conf_t *conf); > +/* When there are this many requests queue to be written by > + * the raid10 thread, we become 'congested' to provide back-pressure > + * for writeback. > + */ > +static int max_queued_requests = 1024; > > static void * r10bio_pool_alloc(gfp_t gfp_flags, void *data) > { > @@ -631,6 +636,10 @@ static int raid10_congested(void *data, int bits) > conf_t *conf = mddev->private; > int i, ret = 0; > > + if ((bits & (1 << BDI_async_congested)) && > + conf->pending_count >= max_queued_requests) > + return 1; > + > if (mddev_congested(mddev, bits)) > return 1; > rcu_read_lock(); > @@ -660,10 +669,12 @@ static int flush_pending_writes(conf_t *conf) > struct bio *bio; > bio = bio_list_get(&conf->pending_bio_list); > blk_remove_plug(conf->mddev->queue); > + conf->pending_count = 0; > spin_unlock_irq(&conf->device_lock); > /* flush any pending bitmap writes to disk > * before proceeding w/ I/O */ > bitmap_unplug(conf->mddev->bitmap); > + wake_up(&conf->wait_barrier); > > while (bio) { /* submit pending writes */ > struct bio *next = bio->bi_next; > @@ -802,6 +813,7 @@ static int make_request(struct request_queue *q, struct bio * bio) > struct bio_list bl; > unsigned long flags; > mdk_rdev_t *blocked_rdev; > + int cnt = 0; > > if (unlikely(bio_rw_flagged(bio, BIO_RW_BARRIER))) { > bio_endio(bio, -EOPNOTSUPP); > @@ -894,6 +906,11 @@ static int make_request(struct request_queue *q, struct bio * bio) > /* > * WRITE: > */ > + if (conf->pending_count >= max_queued_requests) { > + md_wakeup_thread(mddev->thread); > + wait_event(conf->wait_barrier, > + conf->pending_count < max_queued_requests); > + } > /* first select target devices under rcu_lock and > * inc refcount on their rdev. Record them by setting > * bios[x] to bio > @@ -957,6 +974,7 @@ static int make_request(struct request_queue *q, struct bio * bio) > > atomic_inc(&r10_bio->remaining); > bio_list_add(&bl, mbio); > + cnt++ > } > > if (unlikely(!atomic_read(&r10_bio->remaining))) { > @@ -970,6 +988,7 @@ static int make_request(struct request_queue *q, struct bio * bio) > spin_lock_irqsave(&conf->device_lock, flags); > bio_list_merge(&conf->pending_bio_list, &bl); > blk_plug_device(mddev->queue); > + conf->pending_count += cnt; > spin_unlock_irqrestore(&conf->device_lock, flags); > > /* In case raid10d snuck in to freeze_array */ > @@ -2318,3 +2337,5 @@ MODULE_LICENSE("GPL"); > MODULE_ALIAS("md-personality-9"); /* RAID10 */ > MODULE_ALIAS("md-raid10"); > MODULE_ALIAS("md-level-10"); > + > +module_param(max_queued_requests, int, S_IRUGO|S_IWUSR); > diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h > index 59cd1ef..e6e1613 100644 > --- a/drivers/md/raid10.h > +++ b/drivers/md/raid10.h > @@ -39,7 +39,7 @@ struct r10_private_data_s { > struct list_head retry_list; > /* queue pending writes and submit them on unplug */ > struct bio_list pending_bio_list; > - > + int pending_count; > > spinlock_t resync_lock; > int nr_pending; > > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html