So, I re-read the kernel code again, it looks like backdev.cc is doing the correct thing by calling writeback with WB_NO_SYNC, it all looks good, but I don't understand why it would appear read saturated on my system. However I think your commit would definitely make things better, Ideally I think make write only use available bandwidth like sync does, and automatically adjusting. 2011/12/6 Yucong Sun (叶雨飞) <sunyucong@xxxxxxxxx>: > Ok, still, during that time, no read is being finished. > > I'm on Linux vstore-1 2.6.32-35-server #78-Ubuntu SMP Tue Oct 11 > 16:26:12 UTC 2011 x86_64 GNU/Linux > do you know which kernel version has that commit ? 2.6.35 ? > > I think the root cause is that, whenever dirty_background_bytes is > reached, kernel flush thread [flush:254:0] wakes up and cause > md_raid10_d0 to go into state D, which cause everything to hang a > while, I guess maybe the flush thread is calling fsync() after the > write? That's hard to believe, but can actually explain the symptom. > > BTW I don't think limiting batch write to 1024 would solve the > problem, I am actually doing it now because I have to set > dirty_background_bytes to 4M which is exactly 1024 write every second > or so. > > Cheers. > > On Tue, Dec 6, 2011 at 9:10 PM, NeilBrown <neilb@xxxxxxx> wrote: >> On Tue, 6 Dec 2011 20:50:48 -0800 Yucong Sun (叶雨飞) <sunyucong@xxxxxxxxx> >> wrote: >> >>> I'm not sure whether it is what I mean, to illustrate my problem let >>> me put iostat -x -d 1 output as below >>> >>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >>> avgrq-sz avgqu-sz await svctm %util >>> sdb 0.00 0.00 163.00 1.00 1304.00 8.00 >>> 8.00 0.26 1.59 1.59 26.00 >>> sdc 0.00 0.00 93.00 1.00 744.00 8.00 >>> 8.00 0.24 2.55 2.45 23.00 >>> sde 0.00 0.00 56.00 1.00 448.00 8.00 >>> 8.00 0.22 3.86 3.86 22.00 >>> sdd 0.00 0.00 88.00 1.00 704.00 8.00 >>> 8.00 0.18 2.02 2.02 18.00 >>> md_d0 0.00 0.00 401.00 0.00 3208.00 0.00 >>> 8.00 0.00 0.00 0.00 0.00 >>> >>> ==> this is normal operation, because of page cache, there's only read >>> being submitted to the MD device. >>> >>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >>> avgrq-sz avgqu-sz await svctm %util >>> sda 0.00 0.00 0.00 0.00 0.00 0.00 >>> 0.00 0.00 0.00 0.00 0.00 >>> sdb 0.00 1714.00 4.00 277.00 32.00 14810.00 >>> 52.82 34.04 105.05 2.92 82.00 >>> sdc 0.00 1685.00 12.00 270.00 96.00 14122.00 >>> 50.42 42.56 131.03 3.09 87.00 >>> sde 0.00 1385.00 8.00 261.00 64.00 12426.00 >>> 46.43 29.76 99.44 3.35 90.00 >>> sdd 0.00 1350.00 8.00 228.00 64.00 10682.00 >>> 45.53 40.93 133.56 3.69 87.00 >>> md_d0 0.00 0.00 32.00 16446.00 256.00 131568.00 >>> 8.00 0.00 0.00 0.00 0.00 >>> >>> ==> Huge page flush kick in, note the read requests is saturated on MD device. >>> >>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >>> avgrq-sz avgqu-sz await svctm %util >>> sda 0.00 0.00 0.00 0.00 0.00 0.00 >>> 0.00 0.00 0.00 0.00 0.00 >>> sdb 0.00 1542.00 4.00 264.00 32.00 11760.00 >>> 44.00 66.58 230.22 3.73 100.00 >>> sdc 0.00 1185.00 0.00 272.00 0.00 9672.00 >>> 35.56 63.40 215.88 3.68 100.00 >>> sde 0.00 1352.00 0.00 298.00 0.00 12488.00 >>> 41.91 35.56 126.34 3.36 100.00 >>> sdd 0.00 996.00 0.00 294.00 0.00 10120.00 >>> 34.42 76.79 270.37 3.40 100.00 >>> md_d0 0.00 0.00 4.00 0.00 32.00 0.00 >>> 8.00 0.00 0.00 0.00 0.00 >>> >>> ==> Huge page flush still working, no read is being done. >>> >>> This is the problem , when page flush kick in, MD appears to refuse >>> incoming read, all under laying device is tuned to deadline scheduler >>> and tuned to favor read, still, it don't work since MD simply don't >>> submit new read to the underlying device. >> >> The counters are update when a request completes, not when it is submitted, >> so you cannot tell from this data if md is submitting the read requests or >> not. >> >> What kernel are you working with? If it doesn't contain the commit >> identified below can you try with that and see if it makes a difference? >> >> Thanks, >> NeilBrown >> >> >> >>> >>> 2011/12/6 NeilBrown <neilb@xxxxxxx>: >>> > On Tue, 6 Dec 2011 20:04:33 -0800 Yucong Sun (叶雨飞) <sunyucong@xxxxxxxxx> >>> > wrote: >>> > >>> >> The problem with using page-flush as a write cache here is that write >>> >> to MD don't go through IO scheduler, which is a very big problem, >>> >> because when flush thread decide to write to MD, it's impossible to >>> >> control the write speed, or prioritize them with read, every requests >>> >> basically is a fifo, and when flush size is big, no read can be >>> >> served. >>> >> >>> > >>> > I'm not sure I understand.... >>> > >>> > Requests don't go through an IO scheduler before they hit md, but they do >>> > after md sends them on down, so they can be re-ordered there. >>> > >>> > There was a bug where raid10 would allow an arbitrary number of writes to >>> > queue up so that flushing code didn't know when to stop. >>> > >>> > This was fixed by >>> > commit 34db0cd60f8a1f4ab73d118a8be3797c20388223 >>> > >>> > nearly 2 months ago :-) >>> > >>> > NeilBrown >>> > >> -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html