Re: Raid10 and page cache

Yucong Sun (叶雨飞) <sunyucong@xxxxxxxxx> · Tue, 6 Dec 2011 22:14:42 -0800

Ok, still, during that time, no read is being finished.

I'm on  Linux vstore-1 2.6.32-35-server #78-Ubuntu SMP Tue Oct 11
16:26:12 UTC 2011 x86_64 GNU/Linux
do you know which kernel version has that commit ? 2.6.35 ?

I think the root cause is that, whenever dirty_background_bytes is
reached,  kernel flush thread [flush:254:0] wakes up and cause
md_raid10_d0 to go into state D, which cause everything to hang a
while, I guess maybe the flush thread is calling fsync() after the
write? That's hard to believe, but can actually explain the symptom.

BTW I don't think limiting batch write to 1024 would solve the
problem, I am actually doing it now because I have to set
dirty_background_bytes to 4M which is exactly 1024 write every second
or so.

Cheers.

On Tue, Dec 6, 2011 at 9:10 PM, NeilBrown <neilb@xxxxxxx> wrote:
> On Tue, 6 Dec 2011 20:50:48 -0800 Yucong Sun (叶雨飞) <sunyucong@xxxxxxxxx>
> wrote:
>
>> I'm not sure whether it is what I mean,  to illustrate my problem let
>> me put iostat -x -d 1 output  as below
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sdb               0.00     0.00  163.00    1.00  1304.00     8.00
>> 8.00     0.26    1.59   1.59  26.00
>> sdc               0.00     0.00   93.00    1.00   744.00     8.00
>> 8.00     0.24    2.55   2.45  23.00
>> sde               0.00     0.00   56.00    1.00   448.00     8.00
>> 8.00     0.22    3.86 3.86 22.00
>> sdd               0.00     0.00   88.00    1.00   704.00     8.00
>> 8.00     0.18    2.02 2.02 18.00
>> md_d0             0.00     0.00  401.00    0.00  3208.00     0.00
>> 8.00     0.00    0.00   0.00   0.00
>>
>> ==> this is normal operation, because of page cache, there's only read
>> being submitted to the MD device.
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sda               0.00     0.00    0.00    0.00     0.00     0.00
>> 0.00     0.00    0.00   0.00   0.00
>> sdb               0.00  1714.00    4.00  277.00    32.00 14810.00
>> 52.82    34.04  105.05   2.92  82.00
>> sdc               0.00  1685.00   12.00  270.00    96.00 14122.00
>> 50.42    42.56  131.03   3.09  87.00
>> sde               0.00  1385.00    8.00  261.00    64.00 12426.00
>> 46.43    29.76   99.44   3.35  90.00
>> sdd               0.00  1350.00    8.00  228.00    64.00 10682.00
>> 45.53    40.93  133.56   3.69  87.00
>> md_d0             0.00     0.00   32.00 16446.00   256.00 131568.00
>>  8.00     0.00    0.00   0.00   0.00
>>
>> ==> Huge page flush kick in, note the read requests is saturated on MD device.
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sda               0.00     0.00    0.00    0.00     0.00     0.00
>> 0.00     0.00    0.00   0.00   0.00
>> sdb               0.00  1542.00    4.00  264.00    32.00 11760.00
>> 44.00    66.58  230.22   3.73 100.00
>> sdc               0.00  1185.00    0.00  272.00     0.00  9672.00
>> 35.56    63.40  215.88   3.68 100.00
>> sde               0.00  1352.00    0.00  298.00     0.00 12488.00
>> 41.91    35.56  126.34   3.36 100.00
>> sdd               0.00   996.00    0.00  294.00     0.00 10120.00
>> 34.42    76.79  270.37   3.40 100.00
>> md_d0             0.00     0.00    4.00    0.00    32.00     0.00
>> 8.00     0.00    0.00   0.00   0.00
>>
>> ==> Huge page flush still working,  no read is being done.
>>
>> This is the problem , when page flush kick in, MD appears to refuse
>> incoming read,  all under laying device is tuned to deadline scheduler
>> and tuned to favor read, still, it don't work since MD simply don't
>> submit new read to the underlying device.
>
> The counters are update when a request completes, not when it is submitted,
> so you cannot tell from this data if md is submitting the read requests or
> not.
>
> What kernel are you working with?  If it doesn't contain the commit
> identified below can you try with that and see if it makes a difference?
>
> Thanks,
> NeilBrown
>
>
>
>>
>> 2011/12/6 NeilBrown <neilb@xxxxxxx>:
>> > On Tue, 6 Dec 2011 20:04:33 -0800 Yucong Sun (叶雨飞) <sunyucong@xxxxxxxxx>
>> > wrote:
>> >
>> >> The problem with using page-flush as a write cache here is that write
>> >> to MD don't go through IO scheduler, which is a very big problem,
>> >> because when flush thread decide to write to MD,  it's impossible to
>> >> control the write speed, or prioritize them with read, every requests
>> >> basically is a fifo,  and when flush size is big, no read can be
>> >> served.
>> >>
>> >
>> > I'm not sure I understand....
>> >
>> > Requests don't go through an IO scheduler before they hit md, but they do
>> > after md sends them on down, so they can be re-ordered there.
>> >
>> > There was a bug where raid10 would allow an arbitrary number of writes to
>> > queue up so that flushing code didn't know when to stop.
>> >
>> > This was fixed by
>> >   commit 34db0cd60f8a1f4ab73d118a8be3797c20388223
>> >
>> > nearly 2 months ago :-)
>> >
>> > NeilBrown
>> >
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html