On Sun, 20 Oct 2013, Ugis wrote: > >> output follows: > >> #pvs -o pe_start /dev/rbd1p1 > >> 1st PE > >> 4.00m > >> # cat /sys/block/rbd1/queue/minimum_io_size > >> 4194304 > >> # cat /sys/block/rbd1/queue/optimal_io_size > >> 4194304 > > > > Well, the parameters are being set at least. Mike, is it possible that > > having minimum_io_size set to 4m is causing some read amplification > > in LVM, translating a small read into a complete fetch of the PE (or > > somethinga long those lines)? > > > > Ugis, if your cluster is on the small side, it might be interesting to see > > what requests the client is generated in the LVM and non-LVM case by > > setting 'debug ms = 1' on the osds (e.g., ceph tell osd.* injectargs > > '--debug-ms 1') and then looking at the osd_op messages that appear in > > /var/log/ceph/ceph-osd*.log. It may be obvious that the IO pattern is > > different. > > > Sage, here follows debug output. I am no pro in reading this, but > seems read block size differ(or what is that number following ~ sign)? Yep, it's offset~length. It looks like without LVM we're getting 128KB requests (which IIRC is typical), but with LVM it's only 4KB. Unfortunately my memory is a bit fuzzy here, but I seem to recall a property on the request_queue or device that affected this. RBD is currently doing segment_size = rbd_obj_bytes(&rbd_dev->header); blk_queue_max_hw_sectors(q, segment_size / SECTOR_SIZE); blk_queue_max_segment_size(q, segment_size); blk_queue_io_min(q, segment_size); blk_queue_io_opt(q, segment_size); where segment_size is 4MB (so, much more than 128KB); maybe it has something to do with how many smaller ios get coalesced a larger requests? In any case, something appears to be lost due to the pass through LVM, but I'm not very familiar with the block layer code at all... :/ sage > > OSD.2 read with LVM: > 2013-10-20 16:59:05.307159 7f95acfa5700 1 -- x.x.x.x:6804/1944 --> > x.x.x.y:0/269199468 -- osd_op_reply(176566434 > rbd_data.3ad974b0dc51.0000000000007cef [read 4083712~4096] ondisk = 0) > v4 -- ?+0 0xdc35c00 con 0xd9e4840 > 2013-10-20 16:59:05.307655 7f95b27b0700 1 -- x.x.x.x:6804/1944 <== > client.38069 x.x.x.y:0/269199468 5548 ==== > osd_op(client.38069.1:176566435 rbd_data.3ad974b0dc51.0000000000007cef > [read 4087808~4096] 4.5672f053 e6870) v4 ==== 177+0+0 (1554835253 0 0) > 0x12593d80 con 0xd9e4840 > 2013-10-20 16:59:05.307824 7f95ac7a4700 1 -- x.x.x.x:6804/1944 --> > x.x.x.y:0/269199468 -- osd_op_reply(176566435 > rbd_data.3ad974b0dc51.0000000000007cef [read 4087808~4096] ondisk = 0) > v4 -- ?+0 0xe24fc00 con 0xd9e4840 > 2013-10-20 16:59:05.308316 7f95b27b0700 1 -- x.x.x.x:6804/1944 <== > client.38069 x.x.x.y:0/269199468 5549 ==== > osd_op(client.38069.1:176566436 rbd_data.3ad974b0dc51.0000000000007cef > [read 4091904~4096] 4.5672f053 e6870) v4 ==== 177+0+0 (3467296840 0 0) > 0xe28f6c0 con 0xd9e4840 > 2013-10-20 16:59:05.308499 7f95acfa5700 1 -- x.x.x.x:6804/1944 --> > x.x.x.y:0/269199468 -- osd_op_reply(176566436 > rbd_data.3ad974b0dc51.0000000000007cef [read 4091904~4096] ondisk = 0) > v4 -- ?+0 0xdc35a00 con 0xd9e4840 > 2013-10-20 16:59:05.308985 7f95b27b0700 1 -- x.x.x.x:6804/1944 <== > client.38069 x.x.x.y:0/269199468 5550 ==== > osd_op(client.38069.1:176566437 rbd_data.3ad974b0dc51.0000000000007cef > [read 4096000~4096] 4.5672f053 e6870) v4 ==== 177+0+0 (3104591620 0 0) > 0xe0b46c0 con 0xd9e4840 > > OSD.2 read without LVM > 2013-10-20 17:03:13.730881 7f95ac7a4700 1 -- x.x.x.x:6804/1944 --> > x.x.x.y:0/269199468 -- osd_op_reply(176708854 > rb.0.967b.238e1f29.000000000071 [read 2359296~131072] ondisk = 0) v4 > -- ?+0 0x1019d200 con 0xd9e4840 > 2013-10-20 17:03:13.731318 7f95b27b0700 1 -- x.x.x.x:6804/1944 <== > client.38069 x.x.x.y:0/269199468 18232 ==== > osd_op(client.38069.1:176708855 rb.0.967b.238e1f29.000000000071 [read > 2490368~131072] 4.c0d1e4cb e6870) v4 ==== 170+0+0 (1987168552 0 0) > 0x171a7480 con 0xd9e4840 > 2013-10-20 17:03:13.731664 7f95acfa5700 1 -- x.x.x.x:6804/1944 --> > x.x.x.y:0/269199468 -- osd_op_reply(176708855 > rb.0.967b.238e1f29.000000000071 [read 2490368~131072] ondisk = 0) v4 > -- ?+0 0x12b81200 con 0xd9e4840 > 2013-10-20 17:03:13.733112 7f95b27b0700 1 -- x.x.x.x:6804/1944 <== > client.38069 x.x.x.y:0/269199468 18233 ==== > osd_op(client.38069.1:176708856 rb.0.967b.238e1f29.000000000071 [read > 2621440~131072] 4.c0d1e4cb e6870) v4 ==== 170+0+0 (527551382 0 0) > 0x12593d80 con 0xd9e4840 > 2013-10-20 17:03:13.733393 7f95ac7a4700 1 -- x.x.x.x:6804/1944 --> > x.x.x.y:0/269199468 -- osd_op_reply(176708856 > rb.0.967b.238e1f29.000000000071 [read 2621440~131072] ondisk = 0) v4 > -- ?+0 0xeba9000 con 0xd9e4840 > 2013-10-20 17:03:13.733741 7f95b27b0700 1 -- x.x.x.x:6804/1944 <== > client.38069 x.x.x.y:0/269199468 18234 ==== > osd_op(client.38069.1:176708857 rb.0.967b.238e1f29.000000000071 [read > 2752512~131072] 4.c0d1e4cb e6870) v4 ==== 170+0+0 (178955972 0 0) > 0xe0b4d80 con 0xd9e4840 > > How to proceed with tuning read performance on LVM? Is there some > chanage needed in code of ceph/LVM or my config needs to be tuned? > If what is shown in logs means 4k read block in LVM case - then it > seems I need to tell LVM(or xfs on top of LVM dictates read block > side?) that io block should be rather 4m? > > Ugis > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html