Re: krbd splitting large IO's into smaller IO's

Ilya Dryomov <idryomov@xxxxxxxxx> · Thu, 11 Jun 2015 19:02:06 +0300

On Thu, Jun 11, 2015 at 5:30 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
>> Ilya Dryomov
>> Sent: 11 June 2015 12:33
>> To: Nick Fisk
>> Cc: ceph-users
>> Subject: Re:  krbd splitting large IO's into smaller IO's
>>
>> On Thu, Jun 11, 2015 at 2:23 PM, Ilya Dryomov <idryomov@xxxxxxxxx>
>> wrote:
>> > On Wed, Jun 10, 2015 at 7:07 PM, Ilya Dryomov <idryomov@xxxxxxxxx>
>> wrote:
>> >> On Wed, Jun 10, 2015 at 7:04 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>> >>>> > >> -----Original Message-----
>> >>>> > >> From: Ilya Dryomov [mailto:idryomov@xxxxxxxxx]
>> >>>> > >> Sent: 10 June 2015 14:06
>> >>>> > >> To: Nick Fisk
>> >>>> > >> Cc: ceph-users
>> >>>> > >> Subject: Re:  krbd splitting large IO's into
>> >>>> > >> smaller IO's
>> >>>> > >>
>> >>>> > >> On Wed, Jun 10, 2015 at 2:47 PM, Nick Fisk <nick@xxxxxxxxxx>
>> wrote:
>> >>>> > >> > Hi,
>> >>>> > >> >
>> >>>> > >> > Using Kernel RBD client with Kernel 4.03 (I have also tried
>> >>>> > >> > some older kernels with the same effect) and IO is being
>> >>>> > >> > split into smaller IO's which is having a negative impact on
>> performance.
>> >>>> > >> >
>> >>>> > >> > cat /sys/block/sdc/queue/max_hw_sectors_kb
>> >>>> > >> > 4096
>> >>>> > >> >
>> >>>> > >> > cat /sys/block/rbd0/queue/max_sectors_kb
>> >>>> > >> > 4096
>> >>>> > >> >
>> >>>> > >> > Using DD
>> >>>> > >> > dd if=/dev/rbd0 of=/dev/null bs=4M
>> >>>> > >> >
>> >>>> > >> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s
> wkB/s
>> >>>> avgrq-sz
>> >>>> > >> > avgqu-sz   await r_await w_await  svctm  %util
>> >>>> > >> > rbd0              0.00     0.00  201.50    0.00 25792.00
> 0.00
>> >>>> 256.00
>> >>>> > >> > 1.99   10.15   10.15    0.00   4.96 100.00
>> >>>> > >> >
>> >>>> > >> >
>> >>>> > >> > Using FIO with 4M blocks
>> >>>> > >> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s
> wkB/s
>> >>>> avgrq-sz
>> >>>> > >> > avgqu-sz   await r_await w_await  svctm  %util
>> >>>> > >> > rbd0              0.00     0.00  232.00    0.00 118784.00
> 0.00
>> >>>> 1024.00
>> >>>> > >> > 11.29   48.58   48.58    0.00   4.31 100.00
>> >>>> > >> >
>> >>>> > >> > Any ideas why IO sizes are limited to 128k (256 blocks) in
>> >>>> > >> > DD's case and 512k in Fio's case?
>> >>>> > >>
>> >>>> > >> 128k vs 512k is probably buffered vs direct IO - add
>> >>>> > >> iflag=direct to your dd invocation.
>> >>>> > >
>> >>>> > > Yes, thanks for this, that was the case
>> >>>> > >
>> >>>> > >>
>> >>>> > >> As for the 512k - I'm pretty sure it's a regression in our
>> >>>> > >> switch to blk-mq.  I tested it around 3.18-3.19 and saw steady
>> >>>> > >> 4M IOs.  I hope we are just missing a knob - I'll take a look.
>> >>>> > >
>> >>>> > > I've tested both 4.03 and 3.16 and both seem to be split into
> 512k.
>> >>>> > > Let
>> >>>> me
>> >>>> > know if you need me to test any other particular version.
>> >>>> >
>> >>>> > With 3.16 you are going to need to adjust max_hw_sectors_kb /
>> >>>> > max_sectors_kb as discussed in Dan's thread.  The patch that
>> >>>> > fixed that in the block layer went into 3.19, blk-mq into 4.0 - try
> 3.19.
>> >>>>
>> >>>> Sorry should have mentioned, I had adjusted both of them on the
>> >>>> 3.16 kernel to 4096.
>> >>>> I will try 3.19 and let you know.
>> >>>
>> >>> Better with 3.19, but should I not be seeing around 8192, or am I
>> >>> getting my blocks and bytes mixed up?
>> >>>
>> >>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz
>> >>> avgqu-sz   await r_await w_await  svctm  %util
>> >>> rbd0             72.00     0.00   24.00    0.00 49152.00     0.00
> 4096.00
>> >>> 1.96   82.67   82.67    0.00  41.58  99.80
>> >>
>> >> I'd expect 8192.  I'm getting a box for investigation.
>> >
>> > OK, so this is bug in the blk-mq part of block layer.  There is no
>> > plugging going on in the single hardware queue (i.e. krbd) case - it
>> > never once plugs the queue, and that means no request merging is done
>> > for your direct sequential read test.  It gets 512k bios and those
>> > same 512k requests are issued to krbd.
>> >
>> > While queue plugging may not make sense in the multi queue case, I'm
>> > pretty sure it's supposed to plug in the single queue case.  Looks
>> > like use_plug logic in blk_sq_make_request() is busted.
>>
>> It turns out to be a year old regression.  Before commit 07068d5b8ed8
>> ("blk-mq: split make request handler for multi and single queue") it used
> to
>> be (reads are considered sync)
>>
>>     use_plug = !is_flush_fua && ((q->nr_hw_queues == 1) || !is_sync);
>>
>> and now it is
>>
>>     use_plug = !is_flush_fua && !is_sync;
>>
>> in a function that is only called if q->nr_hw_queues == 1.
>>
>> This is getting fixed by "blk-mq: fix plugging in blk_sq_make_request"
>> from Jeff Moyer - http://article.gmane.org/gmane.linux.kernel/1941750.
>> Looks like it's on its way to mainline along with some other blk-mq
> plugging
>> fixes.
>
> That's great, do you think it will make 4.2?

Depends on Jens, but I think it will.

Thanks,

                Ilya
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com