Re: Raid10 and page cache

Yucong Sun (叶雨飞) <sunyucong@xxxxxxxxx> · Tue, 6 Dec 2011 20:50:48 -0800

I'm not sure whether it is what I mean,  to illustrate my problem let
me put iostat -x -d 1 output  as below

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00  163.00    1.00  1304.00     8.00
8.00     0.26    1.59   1.59  26.00
sdc               0.00     0.00   93.00    1.00   744.00     8.00
8.00     0.24    2.55   2.45  23.00
sde               0.00     0.00   56.00    1.00   448.00     8.00
8.00     0.22    3.86   3.86  22.00
sdd               0.00     0.00   88.00    1.00   704.00     8.00
8.00     0.18    2.02   2.02  18.00
md_d0             0.00     0.00  401.00    0.00  3208.00     0.00
8.00     0.00    0.00   0.00   0.00

==> this is normal operation, because of page cache, there's only read
being submitted to the MD device.

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00
sdb               0.00  1714.00    4.00  277.00    32.00 14810.00
52.82    34.04  105.05   2.92  82.00
sdc               0.00  1685.00   12.00  270.00    96.00 14122.00
50.42    42.56  131.03   3.09  87.00
sde               0.00  1385.00    8.00  261.00    64.00 12426.00
46.43    29.76   99.44   3.35  90.00
sdd               0.00  1350.00    8.00  228.00    64.00 10682.00
45.53    40.93  133.56   3.69  87.00
md_d0             0.00     0.00   32.00 16446.00   256.00 131568.00
 8.00     0.00    0.00   0.00   0.00

==> Huge page flush kick in, note the read requests is saturated on MD device.

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00
sdb               0.00  1542.00    4.00  264.00    32.00 11760.00
44.00    66.58  230.22   3.73 100.00
sdc               0.00  1185.00    0.00  272.00     0.00  9672.00
35.56    63.40  215.88   3.68 100.00
sde               0.00  1352.00    0.00  298.00     0.00 12488.00
41.91    35.56  126.34   3.36 100.00
sdd               0.00   996.00    0.00  294.00     0.00 10120.00
34.42    76.79  270.37   3.40 100.00
md_d0             0.00     0.00    4.00    0.00    32.00     0.00
8.00     0.00    0.00   0.00   0.00

==> Huge page flush still working,  no read is being done.

This is the problem , when page flush kick in, MD appears to refuse
incoming read,  all under laying device is tuned to deadline scheduler
and tuned to favor read, still, it don't work since MD simply don't
submit new read to the underlying device.

2011/12/6 NeilBrown <neilb@xxxxxxx>:
> On Tue, 6 Dec 2011 20:04:33 -0800 Yucong Sun (叶雨飞) <sunyucong@xxxxxxxxx>
> wrote:
>
>> The problem with using page-flush as a write cache here is that write
>> to MD don't go through IO scheduler, which is a very big problem,
>> because when flush thread decide to write to MD,  it's impossible to
>> control the write speed, or prioritize them with read, every requests
>> basically is a fifo,  and when flush size is big, no read can be
>> served.
>>
>
> I'm not sure I understand....
>
> Requests don't go through an IO scheduler before they hit md, but they do
> after md sends them on down, so they can be re-ordered there.
>
> There was a bug where raid10 would allow an arbitrary number of writes to
> queue up so that flushing code didn't know when to stop.
>
> This was fixed by
>   commit 34db0cd60f8a1f4ab73d118a8be3797c20388223
>
> nearly 2 months ago :-)
>
> NeilBrown
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html