Re: Raid10 and page cache

Yucong Sun (叶雨飞) <sunyucong@xxxxxxxxx> · Wed, 7 Dec 2011 15:37:30 -0800

Neil, I can't compile latest MD against 2.6.32,  and that commit can't
be patched into 2.6.32 directly either, can you help me on this?

Cheers.

2011/12/7 Yucong Sun (叶雨飞) <sunyucong@xxxxxxxxx>:
> So, I re-read the kernel code again,  it looks like backdev.cc is
> doing the correct thing by calling writeback with WB_NO_SYNC, it all
> looks good, but I don't understand why it would appear read saturated
> on my system.
>
> However I think your commit would definitely make things better,
> Ideally I think make write only use available bandwidth like sync
> does, and automatically adjusting.
>
> 2011/12/6 Yucong Sun (叶雨飞) <sunyucong@xxxxxxxxx>:
>> Ok, still, during that time, no read is being finished.
>>
>> I'm on  Linux vstore-1 2.6.32-35-server #78-Ubuntu SMP Tue Oct 11
>> 16:26:12 UTC 2011 x86_64 GNU/Linux
>> do you know which kernel version has that commit ? 2.6.35 ?
>>
>> I think the root cause is that, whenever dirty_background_bytes is
>> reached,  kernel flush thread [flush:254:0] wakes up and cause
>> md_raid10_d0 to go into state D, which cause everything to hang a
>> while, I guess maybe the flush thread is calling fsync() after the
>> write? That's hard to believe, but can actually explain the symptom.
>>
>> BTW I don't think limiting batch write to 1024 would solve the
>> problem, I am actually doing it now because I have to set
>> dirty_background_bytes to 4M which is exactly 1024 write every second
>> or so.
>>
>> Cheers.
>>
>> On Tue, Dec 6, 2011 at 9:10 PM, NeilBrown <neilb@xxxxxxx> wrote:
>>> On Tue, 6 Dec 2011 20:50:48 -0800 Yucong Sun (叶雨飞) <sunyucong@xxxxxxxxx>
>>> wrote:
>>>
>>>> I'm not sure whether it is what I mean,  to illustrate my problem let
>>>> me put iostat -x -d 1 output  as below
>>>>
>>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>>>> avgrq-sz avgqu-sz   await  svctm  %util
>>>> sdb               0.00     0.00  163.00    1.00  1304.00     8.00
>>>> 8.00     0.26    1.59   1.59  26.00
>>>> sdc               0.00     0.00   93.00    1.00   744.00     8.00
>>>> 8.00     0.24    2.55   2.45  23.00
>>>> sde               0.00     0.00   56.00    1.00   448.00     8.00
>>>> 8.00     0.22    3.86 3.86 22.00
>>>> sdd               0.00     0.00   88.00    1.00   704.00     8.00
>>>> 8.00     0.18    2.02 2.02 18.00
>>>> md_d0             0.00     0.00  401.00    0.00  3208.00     0.00
>>>> 8.00     0.00    0.00   0.00   0.00
>>>>
>>>> ==> this is normal operation, because of page cache, there's only read
>>>> being submitted to the MD device.
>>>>
>>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>>>> avgrq-sz avgqu-sz   await  svctm  %util
>>>> sda               0.00     0.00    0.00    0.00     0.00     0.00
>>>> 0.00     0.00    0.00   0.00   0.00
>>>> sdb               0.00  1714.00    4.00  277.00    32.00 14810.00
>>>> 52.82    34.04  105.05   2.92  82.00
>>>> sdc               0.00  1685.00   12.00  270.00    96.00 14122.00
>>>> 50.42    42.56  131.03   3.09  87.00
>>>> sde               0.00  1385.00    8.00  261.00    64.00 12426.00
>>>> 46.43    29.76   99.44   3.35  90.00
>>>> sdd               0.00  1350.00    8.00  228.00    64.00 10682.00
>>>> 45.53    40.93  133.56   3.69  87.00
>>>> md_d0             0.00     0.00   32.00 16446.00   256.00 131568.00
>>>>  8.00     0.00    0.00   0.00   0.00
>>>>
>>>> ==> Huge page flush kick in, note the read requests is saturated on MD device.
>>>>
>>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>>>> avgrq-sz avgqu-sz   await  svctm  %util
>>>> sda               0.00     0.00    0.00    0.00     0.00     0.00
>>>> 0.00     0.00    0.00   0.00   0.00
>>>> sdb               0.00  1542.00    4.00  264.00    32.00 11760.00
>>>> 44.00    66.58  230.22   3.73 100.00
>>>> sdc               0.00  1185.00    0.00  272.00     0.00  9672.00
>>>> 35.56    63.40  215.88   3.68 100.00
>>>> sde               0.00  1352.00    0.00  298.00     0.00 12488.00
>>>> 41.91    35.56  126.34   3.36 100.00
>>>> sdd               0.00   996.00    0.00  294.00     0.00 10120.00
>>>> 34.42    76.79  270.37   3.40 100.00
>>>> md_d0             0.00     0.00    4.00    0.00    32.00     0.00
>>>> 8.00     0.00    0.00   0.00   0.00
>>>>
>>>> ==> Huge page flush still working,  no read is being done.
>>>>
>>>> This is the problem , when page flush kick in, MD appears to refuse
>>>> incoming read,  all under laying device is tuned to deadline scheduler
>>>> and tuned to favor read, still, it don't work since MD simply don't
>>>> submit new read to the underlying device.
>>>
>>> The counters are update when a request completes, not when it is submitted,
>>> so you cannot tell from this data if md is submitting the read requests or
>>> not.
>>>
>>> What kernel are you working with?  If it doesn't contain the commit
>>> identified below can you try with that and see if it makes a difference?
>>>
>>> Thanks,
>>> NeilBrown
>>>
>>>
>>>
>>>>
>>>> 2011/12/6 NeilBrown <neilb@xxxxxxx>:
>>>> > On Tue, 6 Dec 2011 20:04:33 -0800 Yucong Sun (叶雨飞) <sunyucong@xxxxxxxxx>
>>>> > wrote:
>>>> >
>>>> >> The problem with using page-flush as a write cache here is that write
>>>> >> to MD don't go through IO scheduler, which is a very big problem,
>>>> >> because when flush thread decide to write to MD,  it's impossible to
>>>> >> control the write speed, or prioritize them with read, every requests
>>>> >> basically is a fifo,  and when flush size is big, no read can be
>>>> >> served.
>>>> >>
>>>> >
>>>> > I'm not sure I understand....
>>>> >
>>>> > Requests don't go through an IO scheduler before they hit md, but they do
>>>> > after md sends them on down, so they can be re-ordered there.
>>>> >
>>>> > There was a bug where raid10 would allow an arbitrary number of writes to
>>>> > queue up so that flushing code didn't know when to stop.
>>>> >
>>>> > This was fixed by
>>>> >   commit 34db0cd60f8a1f4ab73d118a8be3797c20388223
>>>> >
>>>> > nearly 2 months ago :-)
>>>> >
>>>> > NeilBrown
>>>> >
>>>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html