Re: Raid10 and page cache

Yucong Sun (叶雨飞) <sunyucong@xxxxxxxxx> · Wed, 7 Dec 2011 01:21:59 -0800

So, I re-read the kernel code again,  it looks like backdev.cc is
doing the correct thing by calling writeback with WB_NO_SYNC, it all
looks good, but I don't understand why it would appear read saturated
on my system.

However I think your commit would definitely make things better,
Ideally I think make write only use available bandwidth like sync
does, and automatically adjusting.

2011/12/6 Yucong Sun (叶雨飞) <sunyucong@xxxxxxxxx>:
> Ok, still, during that time, no read is being finished.
>
> I'm on  Linux vstore-1 2.6.32-35-server #78-Ubuntu SMP Tue Oct 11
> 16:26:12 UTC 2011 x86_64 GNU/Linux
> do you know which kernel version has that commit ? 2.6.35 ?
>
> I think the root cause is that, whenever dirty_background_bytes is
> reached,  kernel flush thread [flush:254:0] wakes up and cause
> md_raid10_d0 to go into state D, which cause everything to hang a
> while, I guess maybe the flush thread is calling fsync() after the
> write? That's hard to believe, but can actually explain the symptom.
>
> BTW I don't think limiting batch write to 1024 would solve the
> problem, I am actually doing it now because I have to set
> dirty_background_bytes to 4M which is exactly 1024 write every second
> or so.
>
> Cheers.
>
> On Tue, Dec 6, 2011 at 9:10 PM, NeilBrown <neilb@xxxxxxx> wrote:
>> On Tue, 6 Dec 2011 20:50:48 -0800 Yucong Sun (叶雨飞) <sunyucong@xxxxxxxxx>
>> wrote:
>>
>>> I'm not sure whether it is what I mean,  to illustrate my problem let
>>> me put iostat -x -d 1 output  as below
>>>
>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>>> avgrq-sz avgqu-sz   await  svctm  %util
>>> sdb               0.00     0.00  163.00    1.00  1304.00     8.00
>>> 8.00     0.26    1.59   1.59  26.00
>>> sdc               0.00     0.00   93.00    1.00   744.00     8.00
>>> 8.00     0.24    2.55   2.45  23.00
>>> sde               0.00     0.00   56.00    1.00   448.00     8.00
>>> 8.00     0.22    3.86 3.86 22.00
>>> sdd               0.00     0.00   88.00    1.00   704.00     8.00
>>> 8.00     0.18    2.02 2.02 18.00
>>> md_d0             0.00     0.00  401.00    0.00  3208.00     0.00
>>> 8.00     0.00    0.00   0.00   0.00
>>>
>>> ==> this is normal operation, because of page cache, there's only read
>>> being submitted to the MD device.
>>>
>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>>> avgrq-sz avgqu-sz   await  svctm  %util
>>> sda               0.00     0.00    0.00    0.00     0.00     0.00
>>> 0.00     0.00    0.00   0.00   0.00
>>> sdb               0.00  1714.00    4.00  277.00    32.00 14810.00
>>> 52.82    34.04  105.05   2.92  82.00
>>> sdc               0.00  1685.00   12.00  270.00    96.00 14122.00
>>> 50.42    42.56  131.03   3.09  87.00
>>> sde               0.00  1385.00    8.00  261.00    64.00 12426.00
>>> 46.43    29.76   99.44   3.35  90.00
>>> sdd               0.00  1350.00    8.00  228.00    64.00 10682.00
>>> 45.53    40.93  133.56   3.69  87.00
>>> md_d0             0.00     0.00   32.00 16446.00   256.00 131568.00
>>>  8.00     0.00    0.00   0.00   0.00
>>>
>>> ==> Huge page flush kick in, note the read requests is saturated on MD device.
>>>
>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>>> avgrq-sz avgqu-sz   await  svctm  %util
>>> sda               0.00     0.00    0.00    0.00     0.00     0.00
>>> 0.00     0.00    0.00   0.00   0.00
>>> sdb               0.00  1542.00    4.00  264.00    32.00 11760.00
>>> 44.00    66.58  230.22   3.73 100.00
>>> sdc               0.00  1185.00    0.00  272.00     0.00  9672.00
>>> 35.56    63.40  215.88   3.68 100.00
>>> sde               0.00  1352.00    0.00  298.00     0.00 12488.00
>>> 41.91    35.56  126.34   3.36 100.00
>>> sdd               0.00   996.00    0.00  294.00     0.00 10120.00
>>> 34.42    76.79  270.37   3.40 100.00
>>> md_d0             0.00     0.00    4.00    0.00    32.00     0.00
>>> 8.00     0.00    0.00   0.00   0.00
>>>
>>> ==> Huge page flush still working,  no read is being done.
>>>
>>> This is the problem , when page flush kick in, MD appears to refuse
>>> incoming read,  all under laying device is tuned to deadline scheduler
>>> and tuned to favor read, still, it don't work since MD simply don't
>>> submit new read to the underlying device.
>>
>> The counters are update when a request completes, not when it is submitted,
>> so you cannot tell from this data if md is submitting the read requests or
>> not.
>>
>> What kernel are you working with?  If it doesn't contain the commit
>> identified below can you try with that and see if it makes a difference?
>>
>> Thanks,
>> NeilBrown
>>
>>
>>
>>>
>>> 2011/12/6 NeilBrown <neilb@xxxxxxx>:
>>> > On Tue, 6 Dec 2011 20:04:33 -0800 Yucong Sun (叶雨飞) <sunyucong@xxxxxxxxx>
>>> > wrote:
>>> >
>>> >> The problem with using page-flush as a write cache here is that write
>>> >> to MD don't go through IO scheduler, which is a very big problem,
>>> >> because when flush thread decide to write to MD,  it's impossible to
>>> >> control the write speed, or prioritize them with read, every requests
>>> >> basically is a fifo,  and when flush size is big, no read can be
>>> >> served.
>>> >>
>>> >
>>> > I'm not sure I understand....
>>> >
>>> > Requests don't go through an IO scheduler before they hit md, but they do
>>> > after md sends them on down, so they can be re-ordered there.
>>> >
>>> > There was a bug where raid10 would allow an arbitrary number of writes to
>>> > queue up so that flushing code didn't know when to stop.
>>> >
>>> > This was fixed by
>>> >   commit 34db0cd60f8a1f4ab73d118a8be3797c20388223
>>> >
>>> > nearly 2 months ago :-)
>>> >
>>> > NeilBrown
>>> >
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html