Ok, still, during that time, no read is being finished. I'm on Linux vstore-1 2.6.32-35-server #78-Ubuntu SMP Tue Oct 11 16:26:12 UTC 2011 x86_64 GNU/Linux do you know which kernel version has that commit ? 2.6.35 ? I think the root cause is that, whenever dirty_background_bytes is reached, kernel flush thread [flush:254:0] wakes up and cause md_raid10_d0 to go into state D, which cause everything to hang a while, I guess maybe the flush thread is calling fsync() after the write? That's hard to believe, but can actually explain the symptom. BTW I don't think limiting batch write to 1024 would solve the problem, I am actually doing it now because I have to set dirty_background_bytes to 4M which is exactly 1024 write every second or so. Cheers. On Tue, Dec 6, 2011 at 9:10 PM, NeilBrown <neilb@xxxxxxx> wrote: > On Tue, 6 Dec 2011 20:50:48 -0800 Yucong Sun (叶雨飞) <sunyucong@xxxxxxxxx> > wrote: > >> I'm not sure whether it is what I mean, to illustrate my problem let >> me put iostat -x -d 1 output as below >> >> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >> avgrq-sz avgqu-sz await svctm %util >> sdb 0.00 0.00 163.00 1.00 1304.00 8.00 >> 8.00 0.26 1.59 1.59 26.00 >> sdc 0.00 0.00 93.00 1.00 744.00 8.00 >> 8.00 0.24 2.55 2.45 23.00 >> sde 0.00 0.00 56.00 1.00 448.00 8.00 >> 8.00 0.22 3.86 3.86 22.00 >> sdd 0.00 0.00 88.00 1.00 704.00 8.00 >> 8.00 0.18 2.02 2.02 18.00 >> md_d0 0.00 0.00 401.00 0.00 3208.00 0.00 >> 8.00 0.00 0.00 0.00 0.00 >> >> ==> this is normal operation, because of page cache, there's only read >> being submitted to the MD device. >> >> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >> avgrq-sz avgqu-sz await svctm %util >> sda 0.00 0.00 0.00 0.00 0.00 0.00 >> 0.00 0.00 0.00 0.00 0.00 >> sdb 0.00 1714.00 4.00 277.00 32.00 14810.00 >> 52.82 34.04 105.05 2.92 82.00 >> sdc 0.00 1685.00 12.00 270.00 96.00 14122.00 >> 50.42 42.56 131.03 3.09 87.00 >> sde 0.00 1385.00 8.00 261.00 64.00 12426.00 >> 46.43 29.76 99.44 3.35 90.00 >> sdd 0.00 1350.00 8.00 228.00 64.00 10682.00 >> 45.53 40.93 133.56 3.69 87.00 >> md_d0 0.00 0.00 32.00 16446.00 256.00 131568.00 >> 8.00 0.00 0.00 0.00 0.00 >> >> ==> Huge page flush kick in, note the read requests is saturated on MD device. >> >> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >> avgrq-sz avgqu-sz await svctm %util >> sda 0.00 0.00 0.00 0.00 0.00 0.00 >> 0.00 0.00 0.00 0.00 0.00 >> sdb 0.00 1542.00 4.00 264.00 32.00 11760.00 >> 44.00 66.58 230.22 3.73 100.00 >> sdc 0.00 1185.00 0.00 272.00 0.00 9672.00 >> 35.56 63.40 215.88 3.68 100.00 >> sde 0.00 1352.00 0.00 298.00 0.00 12488.00 >> 41.91 35.56 126.34 3.36 100.00 >> sdd 0.00 996.00 0.00 294.00 0.00 10120.00 >> 34.42 76.79 270.37 3.40 100.00 >> md_d0 0.00 0.00 4.00 0.00 32.00 0.00 >> 8.00 0.00 0.00 0.00 0.00 >> >> ==> Huge page flush still working, no read is being done. >> >> This is the problem , when page flush kick in, MD appears to refuse >> incoming read, all under laying device is tuned to deadline scheduler >> and tuned to favor read, still, it don't work since MD simply don't >> submit new read to the underlying device. > > The counters are update when a request completes, not when it is submitted, > so you cannot tell from this data if md is submitting the read requests or > not. > > What kernel are you working with? If it doesn't contain the commit > identified below can you try with that and see if it makes a difference? > > Thanks, > NeilBrown > > > >> >> 2011/12/6 NeilBrown <neilb@xxxxxxx>: >> > On Tue, 6 Dec 2011 20:04:33 -0800 Yucong Sun (叶雨飞) <sunyucong@xxxxxxxxx> >> > wrote: >> > >> >> The problem with using page-flush as a write cache here is that write >> >> to MD don't go through IO scheduler, which is a very big problem, >> >> because when flush thread decide to write to MD, it's impossible to >> >> control the write speed, or prioritize them with read, every requests >> >> basically is a fifo, and when flush size is big, no read can be >> >> served. >> >> >> > >> > I'm not sure I understand.... >> > >> > Requests don't go through an IO scheduler before they hit md, but they do >> > after md sends them on down, so they can be re-ordered there. >> > >> > There was a bug where raid10 would allow an arbitrary number of writes to >> > queue up so that flushing code didn't know when to stop. >> > >> > This was fixed by >> > commit 34db0cd60f8a1f4ab73d118a8be3797c20388223 >> > >> > nearly 2 months ago :-) >> > >> > NeilBrown >> > > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html