+DRBD group, which seems to have many linux disk gurus The problem: MD seems to refuse submitting read IO when page flush submits write io. I'm completely stumbled, no matter how hard I tweak the deadline scheduler, it don't seem to make any big difference at all ! noops scheduler has same basic symptom too. I think the only logical explanation is that during the big write storm (generated by page flush), MD is not submitting any read IO to the under laying device, therefore scheduler can not prioritize read and the whole thing is just destroyed. I can't be the only one that is being bite by this If any one want to simulate what's happening, here's a good way: Ubuntu 10.04LTS kernel 2.6.32, I tested 2.6.35 2.6.38 have same issue 1. setup page cache echo $((64*1024*1024)) >> /proc/sys/vm/dirty_bytes echo $((16*1024*1024)) > /proc/sys/vm/dirty_background_bytes 2. setup a raid 10 with 4 disk, have something that generate constant read IO, then have another dd generating constant write IO , size doesn't matter, as long you have some IO goes on 2. watch /proc/meminfo to see dirty page count reaching 16M , and wach iostat -x -d 1 when flusher flush the requests. you can see a period of time that NO read io is finished at all on all disks the disk is all doing write IO, this defeated the whole point of using page cache as a background write-back cache. If any one have similar issue and know how to deal with it please tell me, thanks in advance! On Wed, Dec 7, 2011 at 10:31 PM, Yucong Sun (叶雨飞) <sunyucong@xxxxxxxxx> wrote: > sadly the patch didn't help , > > sadly, the patch didn't help at all, see following > > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 > sdb 0.00 2042.00 0.00 345.00 0.00 64112.00 > 185.83 93.13 93.36 2.12 73.00 > sdd 0.00 1704.00 7.00 156.00 56.00 12496.00 > 77.01 95.71 146.20 3.62 59.00 > sdc 0.00 1518.00 16.00 185.00 128.00 9936.00 > 50.07 98.20 157.41 3.13 63.00 > sde 222.00 1997.00 194.00 189.00 51568.00 16488.00 > 177.69 81.54 99.09 2.25 86.00 > md0 0.00 0.00 37.00 4096.00 296.00 32768.00 > 8.00 0.00 0.00 0.00 0.00 > > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 > sdb 0.00 150.00 0.00 194.00 0.00 33336.00 > 171.84 34.91 492.84 4.59 89.00 > sdd 0.00 0.00 0.00 138.00 0.00 3488.00 > 25.28 32.68 757.75 4.06 56.00 > sdc 0.00 0.00 3.00 127.00 24.00 4704.00 > 36.37 33.68 771.08 4.54 59.00 > sde 222.00 0.00 90.00 84.00 39936.00 1672.00 > 239.13 23.73 386.90 4.08 71.00 > md0 0.00 0.00 2.00 0.00 16.00 0.00 > 8.00 0.00 0.00 0.00 0.00 > > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 > sdb 0.00 235.00 0.00 188.00 0.00 54024.00 > 287.36 0.49 3.78 1.65 31.00 > sdd 0.00 0.00 27.00 0.00 216.00 0.00 > 8.00 0.15 5.56 5.56 15.00 > sdc 0.00 0.00 46.00 0.00 368.00 0.00 > 8.00 0.32 6.52 6.96 32.00 > sde 165.00 0.00 200.00 0.00 43480.00 0.00 > 217.40 7.63 38.15 2.00 40.00 > md0 0.00 0.00 101.00 0.00 808.00 0.00 > 8.00 0.00 0.00 0.00 0.00 > > I poked around and found this, when big flush comes in , > > Every 1.0s: cat /sys/block/sdb/stat /sys/block/sdc/stat > /sys/block/sdd/stat /sys/block/sde/stat /sys/block/md0/stat > Wed Dec 7 22:26:14 2011 > > 32 10 336 270 2792623 5501730 783168880 > 254952160 284 4815060 255014270 > 2993481 2222268 499586400 94384090 493165 1842192 18671608 > 271311440 290 9942910 365758660 > 691727 19 5533896 1507300 501261 1838497 18706544 > 276987570 262 3254420 278552760 > 1458797 1404948 281875858 49664210 483386 1841832 18588928 > 256627020 259 4997270 306348180 > 2797538 0 22380058 0 4652939 0 37223512 > 0 0 0 0 > > Every downstream disk have a Huge in-flight IO jump, where it is > usually just 0 or 1 the whole time. The kernel document says this is > don't include queued IO, so I think the problem is because IO > scheduler issued too many requests to the device , without throttling > read/write, that basically saturated the disk, so no other read can > be scheduled, do you knwo why this would happen to me? > > Here's my relevenat scheduler tweak: > > for disk in /sys/block/sd[bcde] > do > echo "changing $disk scheduler" > echo "deadline" > $disk/queue/scheduler > > echo "changing $disk nr_reqests to 4096" > echo 4096 > $disk/queue/nr_requests > > echo "setra to 0" > echo 0 > $disk/queue/read_ahead_kb > > echo "tweaking deadline io" > echo 32 > $disk/queue/iosched/fifo_batch > echo 30 > $disk/queue/iosched/read_expire > echo 20000 > $disk/queue/iosched/write_expire > echo 256 > $disk/queue/iosched/writes_starved > done > > echo 0 > /sys/block/md0/queue/read_ahead_kb > > My workload profile is 100% random 8K IO. > > Come to think of it, the problem is mostly IO scheduling issue, does > nr_requests mean anything to MD? it's not possible to adjust it > either, was that the reason that MD can't accept more reads? > On Wed, Dec 7, 2011 at 4:10 PM, NeilBrown <neilb@xxxxxxx> wrote: >> >> On Wed, 7 Dec 2011 15:37:30 -0800 Yucong Sun (叶雨飞) <sunyucong@xxxxxxxxx> >> wrote: >> >> > Neil, I can't compile latest MD against 2.6.32, and that commit can't >> > be patched into 2.6.32 directly either, can you help me on this? >> > >> >> This should do it. >> >> NeilBrown >> >> commit ef54b7cf955dc3b7d33248e8591b1a00b4fa998c >> Author: NeilBrown <neilb@xxxxxxx> >> Date: Tue Oct 11 16:50:01 2011 +1100 >> >> md: add proper write-congestion reporting to RAID1 and RAID10. >> >> RAID1 and RAID10 handle write requests by queuing them for handling by >> a separate thread. This is because when a write-intent-bitmap is >> active we might need to update the bitmap first, so it is good to >> queue a lot of writes, then do one big bitmap update for them all. >> >> However writeback request devices to appear to be congested after a >> while so it can make some guesstimate of throughput. The infinite >> queue defeats that (note that RAID5 has already has a finite queue so >> it doesn't suffer from this problem). >> >> So impose a limit on the number of pending write requests. By default >> it is 1024 which seems to be generally suitable. Make it configurable >> via module option just in case someone finds a regression. >> >> Signed-off-by: NeilBrown <neilb@xxxxxxx> >> >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c >> index e07ce2e..fe7ae3c 100644 >> --- a/drivers/md/raid1.c >> +++ b/drivers/md/raid1.c >> @@ -50,6 +50,11 @@ >> */ >> #define NR_RAID1_BIOS 256 >> >> +/* When there are this many requests queue to be written by >> + * the raid1 thread, we become 'congested' to provide back-pressure >> + * for writeback. >> + */ >> +static int max_queued_requests = 1024; >> >> static void unplug_slaves(mddev_t *mddev); >> >> @@ -576,7 +581,8 @@ static int raid1_congested(void *data, int bits) >> conf_t *conf = mddev->private; >> int i, ret = 0; >> >> - if (mddev_congested(mddev, bits)) >> + if (mddev_congested(mddev, bits) && >> + conf->pending_count >= max_queued_requests) >> return 1; >> >> rcu_read_lock(); >> @@ -613,10 +619,12 @@ static int flush_pending_writes(conf_t *conf) >> struct bio *bio; >> bio = bio_list_get(&conf->pending_bio_list); >> blk_remove_plug(conf->mddev->queue); >> + conf->pending_count = 0; >> spin_unlock_irq(&conf->device_lock); >> /* flush any pending bitmap writes to >> * disk before proceeding w/ I/O */ >> bitmap_unplug(conf->mddev->bitmap); >> + wake_up(&conf->wait_barrier); >> >> while (bio) { /* submit pending writes */ >> struct bio *next = bio->bi_next; >> @@ -789,6 +797,7 @@ static int make_request(struct request_queue *q, struct bio * bio) >> int cpu; >> bool do_barriers; >> mdk_rdev_t *blocked_rdev; >> + int cnt = 0; >> >> /* >> * Register the new request and wait if the reconstruction >> @@ -864,6 +873,11 @@ static int make_request(struct request_queue *q, struct bio * bio) >> /* >> * WRITE: >> */ >> + if (conf->pending_count >= max_queued_requests) { >> + md_wakeup_thread(mddev->thread); >> + wait_event(conf->wait_barrier, >> + conf->pending_count < max_queued_requests); >> + } >> /* first select target devices under spinlock and >> * inc refcount on their rdev. Record them by setting >> * bios[x] to bio >> @@ -970,6 +984,7 @@ static int make_request(struct request_queue *q, struct bio * bio) >> atomic_inc(&r1_bio->remaining); >> >> bio_list_add(&bl, mbio); >> + cnt++; >> } >> kfree(behind_pages); /* the behind pages are attached to the bios now */ >> >> @@ -978,6 +993,7 @@ static int make_request(struct request_queue *q, struct bio * bio) >> spin_lock_irqsave(&conf->device_lock, flags); >> bio_list_merge(&conf->pending_bio_list, &bl); >> bio_list_init(&bl); >> + conf->pending_count += cnt; >> >> blk_plug_device(mddev->queue); >> spin_unlock_irqrestore(&conf->device_lock, flags); >> @@ -2021,7 +2037,7 @@ static int run(mddev_t *mddev) >> >> bio_list_init(&conf->pending_bio_list); >> bio_list_init(&conf->flushing_bio_list); >> - >> + conf->pending_count = 0; >> >> mddev->degraded = 0; >> for (i = 0; i < conf->raid_disks; i++) { >> @@ -2317,3 +2333,5 @@ MODULE_LICENSE("GPL"); >> MODULE_ALIAS("md-personality-3"); /* RAID1 */ >> MODULE_ALIAS("md-raid1"); >> MODULE_ALIAS("md-level-1"); >> + >> +module_param(max_queued_requests, int, S_IRUGO|S_IWUSR); >> diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h >> index e87b84d..520288c 100644 >> --- a/drivers/md/raid1.h >> +++ b/drivers/md/raid1.h >> @@ -38,6 +38,7 @@ struct r1_private_data_s { >> /* queue of writes that have been unplugged */ >> struct bio_list flushing_bio_list; >> >> + int pending_count; >> /* for use when syncing mirrors: */ >> >> spinlock_t resync_lock; >> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c >> index c2cb7b8..4c7d9b5 100644 >> --- a/drivers/md/raid10.c >> +++ b/drivers/md/raid10.c >> @@ -59,6 +59,11 @@ static void unplug_slaves(mddev_t *mddev); >> >> static void allow_barrier(conf_t *conf); >> static void lower_barrier(conf_t *conf); >> +/* When there are this many requests queue to be written by >> + * the raid10 thread, we become 'congested' to provide back-pressure >> + * for writeback. >> + */ >> +static int max_queued_requests = 1024; >> >> static void * r10bio_pool_alloc(gfp_t gfp_flags, void *data) >> { >> @@ -631,6 +636,10 @@ static int raid10_congested(void *data, int bits) >> conf_t *conf = mddev->private; >> int i, ret = 0; >> >> + if ((bits & (1 << BDI_async_congested)) && >> + conf->pending_count >= max_queued_requests) >> + return 1; >> + >> if (mddev_congested(mddev, bits)) >> return 1; >> rcu_read_lock(); >> @@ -660,10 +669,12 @@ static int flush_pending_writes(conf_t *conf) >> struct bio *bio; >> bio = bio_list_get(&conf->pending_bio_list); >> blk_remove_plug(conf->mddev->queue); >> + conf->pending_count = 0; >> spin_unlock_irq(&conf->device_lock); >> /* flush any pending bitmap writes to disk >> * before proceeding w/ I/O */ >> bitmap_unplug(conf->mddev->bitmap); >> + wake_up(&conf->wait_barrier); >> >> while (bio) { /* submit pending writes */ >> struct bio *next = bio->bi_next; >> @@ -802,6 +813,7 @@ static int make_request(struct request_queue *q, struct bio * bio) >> struct bio_list bl; >> unsigned long flags; >> mdk_rdev_t *blocked_rdev; >> + int cnt = 0; >> >> if (unlikely(bio_rw_flagged(bio, BIO_RW_BARRIER))) { >> bio_endio(bio, -EOPNOTSUPP); >> @@ -894,6 +906,11 @@ static int make_request(struct request_queue *q, struct bio * bio) >> /* >> * WRITE: >> */ >> + if (conf->pending_count >= max_queued_requests) { >> + md_wakeup_thread(mddev->thread); >> + wait_event(conf->wait_barrier, >> + conf->pending_count < max_queued_requests); >> + } >> /* first select target devices under rcu_lock and >> * inc refcount on their rdev. Record them by setting >> * bios[x] to bio >> @@ -957,6 +974,7 @@ static int make_request(struct request_queue *q, struct bio * bio) >> >> atomic_inc(&r10_bio->remaining); >> bio_list_add(&bl, mbio); >> + cnt++ >> } >> >> if (unlikely(!atomic_read(&r10_bio->remaining))) { >> @@ -970,6 +988,7 @@ static int make_request(struct request_queue *q, struct bio * bio) >> spin_lock_irqsave(&conf->device_lock, flags); >> bio_list_merge(&conf->pending_bio_list, &bl); >> blk_plug_device(mddev->queue); >> + conf->pending_count += cnt; >> spin_unlock_irqrestore(&conf->device_lock, flags); >> >> /* In case raid10d snuck in to freeze_array */ >> @@ -2318,3 +2337,5 @@ MODULE_LICENSE("GPL"); >> MODULE_ALIAS("md-personality-9"); /* RAID10 */ >> MODULE_ALIAS("md-raid10"); >> MODULE_ALIAS("md-level-10"); >> + >> +module_param(max_queued_requests, int, S_IRUGO|S_IWUSR); >> diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h >> index 59cd1ef..e6e1613 100644 >> --- a/drivers/md/raid10.h >> +++ b/drivers/md/raid10.h >> @@ -39,7 +39,7 @@ struct r10_private_data_s { >> struct list_head retry_list; >> /* queue pending writes and submit them on unplug */ >> struct bio_list pending_bio_list; >> - >> + int pending_count; >> >> spinlock_t resync_lock; >> int nr_pending; >> >> -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html