Re: RBD cache being filled up in small increases instead of 4MB

Jason Dillaman <jdillama@xxxxxxxxxx> · Mon, 17 Jul 2017 08:24:47 -0400

On Sat, Jul 15, 2017 at 8:00 PM, Ruben Rodriguez <ruben@xxxxxxx> wrote:
>
>
> On 14/07/17 18:43, Ruben Rodriguez wrote:
>> How to reproduce...
>
> I'll provide more concise details on how to test this behavior:
>
> Ceph config:
>
> [client]
> rbd readahead max bytes = 0 # we don't want forced readahead to fool us
> rbd cache = true
>
> Start a qemu vm, with a rbd image attached with virtio-scsi:
>    <disk type='network' device='disk'>
>       <driver name='qemu' type='raw' cache='writeback'/>
>       <auth username='libvirt'>
>         <secret type='ceph' uuid='...'/>
>       </auth>
>       <source protocol='rbd' name='libvirt-pool/test'>
>         <host name='cephmon1' port='6789'/>
>         <host name='cephmon2' port='6789'/>
>         <host name='cephmon3' port='6789'/>
>       </source>
>       <blockio logical_block_size='512' physical_block_size='512'/>
>       <target dev='sdb' bus='scsi'/>
>       <boot order='2'/>
>       <address type='drive' controller='0' bus='0' target='0' unit='1'/>
>     </disk>
>
> Block device parameters, inside the vm:
> NAME ALIGN  MIN-IO  OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE   RA WSAME
> sdb      0 4194304 4194304     512     512    1 noop      128 4096    2G
>
> Collect performance statistics from librbd, using command:
>
> $ ceph --admin-daemon /var/run/ceph/ceph-client.[...].asok perf dump
>
> Note the values for:
> - rd: number of read operations done by qemu
> - rd_bytes: length of read requests done by qemu
> - cache_ops_hit: read operations hitting the cache
> - cache_ops_miss: read ops missing the cache
> - data_read: data read from the cache
> - op_r: number of objects sent by the OSD
>
> Perform one small read, not at the beginning of the image (because udev
> may have read it already), at a 4MB boundary line:
>
> dd if=/dev/sda ibs=512 count=1 skip=41943040 iflag=skip_bytes
>
> Do it again advancing 5000 bytes (to not overlap with the previous read)
> Run the perf dump command again
>
> dd if=/dev/sda ibs=512 count=1 skip=41948040 iflag=skip_bytes
> Run the perf dump command again
>
> If you compare the op_r values at each step, you should see a cache miss
> each time, and a object read each time. Same object fetched twice.
>
> IMPACT:
>
> Let's take a look at how the op_r value increases by doing some common
> operations:
>
> - Booting a vm: This operation needs (in my case) ~70MB to be read,
> which include the kernel, initrd and all files read by systemd and
> daemons, until a command prompt appears. Values read
>         "rd": 2524,
>         "rd_bytes": 69685248,
>         "cache_ops_hit": 228,
>         "cache_ops_miss": 2268,
>         "cache_bytes_hit": 90353664,
>         "cache_bytes_miss": 63902720,
>         "data_read": 69186560,
>         "op": 2295,
>         "op_r": 2279,
> That is 2299 objects being fetched from the OSD to read 69MB.
>
> - Greping inside the linux source code (833MB), takes almost 3 minutes.
>   Values get increased to:
>         "rd": 65127,
>         "rd_bytes": 1081487360,
>         "cache_ops_hit": 228,
>         "cache_ops_miss": 64885,
>         "cache_bytes_hit": 90353664,
>         "cache_bytes_miss": 1075672064,
>         "data_read": 1080988672,
>         "op_r": 64896,
> That is over 60.000 objects fetched to read <1GB, and *0* cache hits.
> Optimized, this should take 10 seconds, and fetch ~700 objects.
>
> Is my Qemu implementation completely broken? Or is this expected? Please
> help!

I recommend watching the IO patterns via blktrace. The "60.000 objects
fetched" is a semi-misnomer -- it's saying that ~65,000 individual IO
operations were sent to the OSD. This doesn't imply that the
operations are against unique objects (i.e. there might be a lot of
ops hitting the same object). Your average IO size is at least 16K so
there must be some level of OS request merging / readhead since
otherwise I would expect ~2 million 512 byte IO requests.

>
> --
> Ruben Rodriguez | Senior Systems Administrator, Free Software Foundation
> GPG Key: 05EF 1D2F FE61 747D 1FC8 27C3 7FAC 7D26 472F 4409
> https://fsf.org | https://gnu.org
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com