On Thu, Jun 11, 2015 at 5:30 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: >> -----Original Message----- >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of >> Ilya Dryomov >> Sent: 11 June 2015 12:33 >> To: Nick Fisk >> Cc: ceph-users >> Subject: Re: krbd splitting large IO's into smaller IO's >> >> On Thu, Jun 11, 2015 at 2:23 PM, Ilya Dryomov <idryomov@xxxxxxxxx> >> wrote: >> > On Wed, Jun 10, 2015 at 7:07 PM, Ilya Dryomov <idryomov@xxxxxxxxx> >> wrote: >> >> On Wed, Jun 10, 2015 at 7:04 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: >> >>>> > >> -----Original Message----- >> >>>> > >> From: Ilya Dryomov [mailto:idryomov@xxxxxxxxx] >> >>>> > >> Sent: 10 June 2015 14:06 >> >>>> > >> To: Nick Fisk >> >>>> > >> Cc: ceph-users >> >>>> > >> Subject: Re: krbd splitting large IO's into >> >>>> > >> smaller IO's >> >>>> > >> >> >>>> > >> On Wed, Jun 10, 2015 at 2:47 PM, Nick Fisk <nick@xxxxxxxxxx> >> wrote: >> >>>> > >> > Hi, >> >>>> > >> > >> >>>> > >> > Using Kernel RBD client with Kernel 4.03 (I have also tried >> >>>> > >> > some older kernels with the same effect) and IO is being >> >>>> > >> > split into smaller IO's which is having a negative impact on >> performance. >> >>>> > >> > >> >>>> > >> > cat /sys/block/sdc/queue/max_hw_sectors_kb >> >>>> > >> > 4096 >> >>>> > >> > >> >>>> > >> > cat /sys/block/rbd0/queue/max_sectors_kb >> >>>> > >> > 4096 >> >>>> > >> > >> >>>> > >> > Using DD >> >>>> > >> > dd if=/dev/rbd0 of=/dev/null bs=4M >> >>>> > >> > >> >>>> > >> > Device: rrqm/s wrqm/s r/s w/s rkB/s > wkB/s >> >>>> avgrq-sz >> >>>> > >> > avgqu-sz await r_await w_await svctm %util >> >>>> > >> > rbd0 0.00 0.00 201.50 0.00 25792.00 > 0.00 >> >>>> 256.00 >> >>>> > >> > 1.99 10.15 10.15 0.00 4.96 100.00 >> >>>> > >> > >> >>>> > >> > >> >>>> > >> > Using FIO with 4M blocks >> >>>> > >> > Device: rrqm/s wrqm/s r/s w/s rkB/s > wkB/s >> >>>> avgrq-sz >> >>>> > >> > avgqu-sz await r_await w_await svctm %util >> >>>> > >> > rbd0 0.00 0.00 232.00 0.00 118784.00 > 0.00 >> >>>> 1024.00 >> >>>> > >> > 11.29 48.58 48.58 0.00 4.31 100.00 >> >>>> > >> > >> >>>> > >> > Any ideas why IO sizes are limited to 128k (256 blocks) in >> >>>> > >> > DD's case and 512k in Fio's case? >> >>>> > >> >> >>>> > >> 128k vs 512k is probably buffered vs direct IO - add >> >>>> > >> iflag=direct to your dd invocation. >> >>>> > > >> >>>> > > Yes, thanks for this, that was the case >> >>>> > > >> >>>> > >> >> >>>> > >> As for the 512k - I'm pretty sure it's a regression in our >> >>>> > >> switch to blk-mq. I tested it around 3.18-3.19 and saw steady >> >>>> > >> 4M IOs. I hope we are just missing a knob - I'll take a look. >> >>>> > > >> >>>> > > I've tested both 4.03 and 3.16 and both seem to be split into > 512k. >> >>>> > > Let >> >>>> me >> >>>> > know if you need me to test any other particular version. >> >>>> > >> >>>> > With 3.16 you are going to need to adjust max_hw_sectors_kb / >> >>>> > max_sectors_kb as discussed in Dan's thread. The patch that >> >>>> > fixed that in the block layer went into 3.19, blk-mq into 4.0 - try > 3.19. >> >>>> >> >>>> Sorry should have mentioned, I had adjusted both of them on the >> >>>> 3.16 kernel to 4096. >> >>>> I will try 3.19 and let you know. >> >>> >> >>> Better with 3.19, but should I not be seeing around 8192, or am I >> >>> getting my blocks and bytes mixed up? >> >>> >> >>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s > avgrq-sz >> >>> avgqu-sz await r_await w_await svctm %util >> >>> rbd0 72.00 0.00 24.00 0.00 49152.00 0.00 > 4096.00 >> >>> 1.96 82.67 82.67 0.00 41.58 99.80 >> >> >> >> I'd expect 8192. I'm getting a box for investigation. >> > >> > OK, so this is bug in the blk-mq part of block layer. There is no >> > plugging going on in the single hardware queue (i.e. krbd) case - it >> > never once plugs the queue, and that means no request merging is done >> > for your direct sequential read test. It gets 512k bios and those >> > same 512k requests are issued to krbd. >> > >> > While queue plugging may not make sense in the multi queue case, I'm >> > pretty sure it's supposed to plug in the single queue case. Looks >> > like use_plug logic in blk_sq_make_request() is busted. >> >> It turns out to be a year old regression. Before commit 07068d5b8ed8 >> ("blk-mq: split make request handler for multi and single queue") it used > to >> be (reads are considered sync) >> >> use_plug = !is_flush_fua && ((q->nr_hw_queues == 1) || !is_sync); >> >> and now it is >> >> use_plug = !is_flush_fua && !is_sync; >> >> in a function that is only called if q->nr_hw_queues == 1. >> >> This is getting fixed by "blk-mq: fix plugging in blk_sq_make_request" >> from Jeff Moyer - http://article.gmane.org/gmane.linux.kernel/1941750. >> Looks like it's on its way to mainline along with some other blk-mq > plugging >> fixes. > > That's great, do you think it will make 4.2? Depends on Jens, but I think it will. Thanks, Ilya _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com