Neil, I can't compile latest MD against 2.6.32, and that commit can't be patched into 2.6.32 directly either, can you help me on this? Cheers. 2011/12/7 Yucong Sun (叶雨飞) <sunyucong@xxxxxxxxx>: > So, I re-read the kernel code again, it looks like backdev.cc is > doing the correct thing by calling writeback with WB_NO_SYNC, it all > looks good, but I don't understand why it would appear read saturated > on my system. > > However I think your commit would definitely make things better, > Ideally I think make write only use available bandwidth like sync > does, and automatically adjusting. > > 2011/12/6 Yucong Sun (叶雨飞) <sunyucong@xxxxxxxxx>: >> Ok, still, during that time, no read is being finished. >> >> I'm on Linux vstore-1 2.6.32-35-server #78-Ubuntu SMP Tue Oct 11 >> 16:26:12 UTC 2011 x86_64 GNU/Linux >> do you know which kernel version has that commit ? 2.6.35 ? >> >> I think the root cause is that, whenever dirty_background_bytes is >> reached, kernel flush thread [flush:254:0] wakes up and cause >> md_raid10_d0 to go into state D, which cause everything to hang a >> while, I guess maybe the flush thread is calling fsync() after the >> write? That's hard to believe, but can actually explain the symptom. >> >> BTW I don't think limiting batch write to 1024 would solve the >> problem, I am actually doing it now because I have to set >> dirty_background_bytes to 4M which is exactly 1024 write every second >> or so. >> >> Cheers. >> >> On Tue, Dec 6, 2011 at 9:10 PM, NeilBrown <neilb@xxxxxxx> wrote: >>> On Tue, 6 Dec 2011 20:50:48 -0800 Yucong Sun (叶雨飞) <sunyucong@xxxxxxxxx> >>> wrote: >>> >>>> I'm not sure whether it is what I mean, to illustrate my problem let >>>> me put iostat -x -d 1 output as below >>>> >>>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >>>> avgrq-sz avgqu-sz await svctm %util >>>> sdb 0.00 0.00 163.00 1.00 1304.00 8.00 >>>> 8.00 0.26 1.59 1.59 26.00 >>>> sdc 0.00 0.00 93.00 1.00 744.00 8.00 >>>> 8.00 0.24 2.55 2.45 23.00 >>>> sde 0.00 0.00 56.00 1.00 448.00 8.00 >>>> 8.00 0.22 3.86 3.86 22.00 >>>> sdd 0.00 0.00 88.00 1.00 704.00 8.00 >>>> 8.00 0.18 2.02 2.02 18.00 >>>> md_d0 0.00 0.00 401.00 0.00 3208.00 0.00 >>>> 8.00 0.00 0.00 0.00 0.00 >>>> >>>> ==> this is normal operation, because of page cache, there's only read >>>> being submitted to the MD device. >>>> >>>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >>>> avgrq-sz avgqu-sz await svctm %util >>>> sda 0.00 0.00 0.00 0.00 0.00 0.00 >>>> 0.00 0.00 0.00 0.00 0.00 >>>> sdb 0.00 1714.00 4.00 277.00 32.00 14810.00 >>>> 52.82 34.04 105.05 2.92 82.00 >>>> sdc 0.00 1685.00 12.00 270.00 96.00 14122.00 >>>> 50.42 42.56 131.03 3.09 87.00 >>>> sde 0.00 1385.00 8.00 261.00 64.00 12426.00 >>>> 46.43 29.76 99.44 3.35 90.00 >>>> sdd 0.00 1350.00 8.00 228.00 64.00 10682.00 >>>> 45.53 40.93 133.56 3.69 87.00 >>>> md_d0 0.00 0.00 32.00 16446.00 256.00 131568.00 >>>> 8.00 0.00 0.00 0.00 0.00 >>>> >>>> ==> Huge page flush kick in, note the read requests is saturated on MD device. >>>> >>>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >>>> avgrq-sz avgqu-sz await svctm %util >>>> sda 0.00 0.00 0.00 0.00 0.00 0.00 >>>> 0.00 0.00 0.00 0.00 0.00 >>>> sdb 0.00 1542.00 4.00 264.00 32.00 11760.00 >>>> 44.00 66.58 230.22 3.73 100.00 >>>> sdc 0.00 1185.00 0.00 272.00 0.00 9672.00 >>>> 35.56 63.40 215.88 3.68 100.00 >>>> sde 0.00 1352.00 0.00 298.00 0.00 12488.00 >>>> 41.91 35.56 126.34 3.36 100.00 >>>> sdd 0.00 996.00 0.00 294.00 0.00 10120.00 >>>> 34.42 76.79 270.37 3.40 100.00 >>>> md_d0 0.00 0.00 4.00 0.00 32.00 0.00 >>>> 8.00 0.00 0.00 0.00 0.00 >>>> >>>> ==> Huge page flush still working, no read is being done. >>>> >>>> This is the problem , when page flush kick in, MD appears to refuse >>>> incoming read, all under laying device is tuned to deadline scheduler >>>> and tuned to favor read, still, it don't work since MD simply don't >>>> submit new read to the underlying device. >>> >>> The counters are update when a request completes, not when it is submitted, >>> so you cannot tell from this data if md is submitting the read requests or >>> not. >>> >>> What kernel are you working with? If it doesn't contain the commit >>> identified below can you try with that and see if it makes a difference? >>> >>> Thanks, >>> NeilBrown >>> >>> >>> >>>> >>>> 2011/12/6 NeilBrown <neilb@xxxxxxx>: >>>> > On Tue, 6 Dec 2011 20:04:33 -0800 Yucong Sun (叶雨飞) <sunyucong@xxxxxxxxx> >>>> > wrote: >>>> > >>>> >> The problem with using page-flush as a write cache here is that write >>>> >> to MD don't go through IO scheduler, which is a very big problem, >>>> >> because when flush thread decide to write to MD, it's impossible to >>>> >> control the write speed, or prioritize them with read, every requests >>>> >> basically is a fifo, and when flush size is big, no read can be >>>> >> served. >>>> >> >>>> > >>>> > I'm not sure I understand.... >>>> > >>>> > Requests don't go through an IO scheduler before they hit md, but they do >>>> > after md sends them on down, so they can be re-ordered there. >>>> > >>>> > There was a bug where raid10 would allow an arbitrary number of writes to >>>> > queue up so that flushing code didn't know when to stop. >>>> > >>>> > This was fixed by >>>> > commit 34db0cd60f8a1f4ab73d118a8be3797c20388223 >>>> > >>>> > nearly 2 months ago :-) >>>> > >>>> > NeilBrown >>>> > >>> -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html