On Wed, Jun 10, 2015 at 7:07 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: > On Wed, Jun 10, 2015 at 7:04 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: >>> > >> -----Original Message----- >>> > >> From: Ilya Dryomov [mailto:idryomov@xxxxxxxxx] >>> > >> Sent: 10 June 2015 14:06 >>> > >> To: Nick Fisk >>> > >> Cc: ceph-users >>> > >> Subject: Re: krbd splitting large IO's into smaller >>> > >> IO's >>> > >> >>> > >> On Wed, Jun 10, 2015 at 2:47 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: >>> > >> > Hi, >>> > >> > >>> > >> > Using Kernel RBD client with Kernel 4.03 (I have also tried some >>> > >> > older kernels with the same effect) and IO is being split into >>> > >> > smaller IO's which is having a negative impact on performance. >>> > >> > >>> > >> > cat /sys/block/sdc/queue/max_hw_sectors_kb >>> > >> > 4096 >>> > >> > >>> > >> > cat /sys/block/rbd0/queue/max_sectors_kb >>> > >> > 4096 >>> > >> > >>> > >> > Using DD >>> > >> > dd if=/dev/rbd0 of=/dev/null bs=4M >>> > >> > >>> > >> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s >>> avgrq-sz >>> > >> > avgqu-sz await r_await w_await svctm %util >>> > >> > rbd0 0.00 0.00 201.50 0.00 25792.00 0.00 >>> 256.00 >>> > >> > 1.99 10.15 10.15 0.00 4.96 100.00 >>> > >> > >>> > >> > >>> > >> > Using FIO with 4M blocks >>> > >> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s >>> avgrq-sz >>> > >> > avgqu-sz await r_await w_await svctm %util >>> > >> > rbd0 0.00 0.00 232.00 0.00 118784.00 0.00 >>> 1024.00 >>> > >> > 11.29 48.58 48.58 0.00 4.31 100.00 >>> > >> > >>> > >> > Any ideas why IO sizes are limited to 128k (256 blocks) in DD's >>> > >> > case and 512k in Fio's case? >>> > >> >>> > >> 128k vs 512k is probably buffered vs direct IO - add iflag=direct >>> > >> to your dd invocation. >>> > > >>> > > Yes, thanks for this, that was the case >>> > > >>> > >> >>> > >> As for the 512k - I'm pretty sure it's a regression in our switch >>> > >> to blk-mq. I tested it around 3.18-3.19 and saw steady 4M IOs. I >>> > >> hope we are just missing a knob - I'll take a look. >>> > > >>> > > I've tested both 4.03 and 3.16 and both seem to be split into 512k. >>> > > Let >>> me >>> > know if you need me to test any other particular version. >>> > >>> > With 3.16 you are going to need to adjust max_hw_sectors_kb / >>> > max_sectors_kb as discussed in Dan's thread. The patch that fixed >>> > that in the block layer went into 3.19, blk-mq into 4.0 - try 3.19. >>> >>> Sorry should have mentioned, I had adjusted both of them on the 3.16 >>> kernel to 4096. >>> I will try 3.19 and let you know. >> >> Better with 3.19, but should I not be seeing around 8192, or am I getting my >> blocks and bytes mixed up? >> >> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz >> avgqu-sz await r_await w_await svctm %util >> rbd0 72.00 0.00 24.00 0.00 49152.00 0.00 4096.00 >> 1.96 82.67 82.67 0.00 41.58 99.80 > > I'd expect 8192. I'm getting a box for investigation. OK, so this is bug in the blk-mq part of block layer. There is no plugging going on in the single hardware queue (i.e. krbd) case - it never once plugs the queue, and that means no request merging is done for your direct sequential read test. It gets 512k bios and those same 512k requests are issued to krbd. While queue plugging may not make sense in the multi queue case, I'm pretty sure it's supposed to plug in the single queue case. Looks like use_plug logic in blk_sq_make_request() is busted. Thanks, Ilya _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com