On Fri, Nov 28, 2014 at 5:46 PM, Dan Van Der Ster <daniel.vanderster@xxxxxxx> wrote: > Hi Andrei, > Yes, I’m testing from within the guest. > > Here is an example. First, I do 2MB reads when the max_sectors_kb=512, and > we see the reads are split into 4. (fio sees 25 iops, though iostat reports > 100 smaller iops): > > # echo 512 > /sys/block/vdb/queue/max_sectors_kb # this is the default > # fio --readonly --name /dev/vdb --rw=read --size=1G --ioengine=libaio > --direct=1 --runtime=10s --blocksize=2m > /dev/vdb: (g=0): rw=read, bs=2M-2M/2M-2M/2M-2M, ioengine=libaio, iodepth=1 > fio-2.0.13 > Starting 1 process > Jobs: 1 (f=1): [R] [100.0% done] [51200K/0K/0K /s] [25 /0 /0 iops] [eta > 00m:00s] > > meanwhile iostat is reporting 100 iops of average size 1024 sectors (i.e. > 512kB): > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz > avgqu-sz await svctm %util > vdb 0.00 0.00 100.00 0.00 50.00 0.00 1024.00 > 3.02 30.25 10.00 100.00 > > > > Now increase the max_sectors_kb to 4MB, and the IOs are no longer split: > > # echo 4096 > /sys/block/vdb/queue/max_sectors_kb > # fio --readonly --name /dev/vdb --rw=read --size=1G --ioengine=libaio > --direct=1 --runtime=10s --blocksize=2m > /dev/vdb: (g=0): rw=read, bs=2M-2M/2M-2M/2M-2M, ioengine=libaio, iodepth=1 > fio-2.0.13 > Starting 1 process > Jobs: 1 (f=1): [R] [100.0% done] [200.0M/0K/0K /s] [100 /0 /0 iops] [eta > 00m:00s] > > iostat reports 100 iops, 4096 sectors each read (i.e. 2MB): > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz > avgqu-sz await svctm %util > vdb 300.00 0.00 100.00 0.00 200.00 0.00 4096.00 > 0.99 9.94 9.94 99.40 We set the hard request size limit to rbd object size (4M typically) blk_queue_max_hw_sectors(q, segment_size / SECTOR_SIZE); but block layer then sets the soft limit for fs requests to 512K BLK_DEF_MAX_SECTORS = 1024, limits->max_sectors = min_t(unsigned int, max_hw_sectors, BLK_DEF_MAX_SECTORS); which you are supposed to change on a per-device basis via sysfs. We could probably raise the soft limit to rbd object size by default as well - I don't see any harm in that. Thanks, Ilya _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com