On 14/07/17 18:43, Ruben Rodriguez wrote: > How to reproduce... I'll provide more concise details on how to test this behavior: Ceph config: [client] rbd readahead max bytes = 0 # we don't want forced readahead to fool us rbd cache = true Start a qemu vm, with a rbd image attached with virtio-scsi: <disk type='network' device='disk'> <driver name='qemu' type='raw' cache='writeback'/> <auth username='libvirt'> <secret type='ceph' uuid='...'/> </auth> <source protocol='rbd' name='libvirt-pool/test'> <host name='cephmon1' port='6789'/> <host name='cephmon2' port='6789'/> <host name='cephmon3' port='6789'/> </source> <blockio logical_block_size='512' physical_block_size='512'/> <target dev='sdb' bus='scsi'/> <boot order='2'/> <address type='drive' controller='0' bus='0' target='0' unit='1'/> </disk> Block device parameters, inside the vm: NAME ALIGN MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME sdb 0 4194304 4194304 512 512 1 noop 128 4096 2G Collect performance statistics from librbd, using command: $ ceph --admin-daemon /var/run/ceph/ceph-client.[...].asok perf dump Note the values for: - rd: number of read operations done by qemu - rd_bytes: length of read requests done by qemu - cache_ops_hit: read operations hitting the cache - cache_ops_miss: read ops missing the cache - data_read: data read from the cache - op_r: number of objects sent by the OSD Perform one small read, not at the beginning of the image (because udev may have read it already), at a 4MB boundary line: dd if=/dev/sda ibs=512 count=1 skip=41943040 iflag=skip_bytes Do it again advancing 5000 bytes (to not overlap with the previous read) Run the perf dump command again dd if=/dev/sda ibs=512 count=1 skip=41948040 iflag=skip_bytes Run the perf dump command again If you compare the op_r values at each step, you should see a cache miss each time, and a object read each time. Same object fetched twice. IMPACT: Let's take a look at how the op_r value increases by doing some common operations: - Booting a vm: This operation needs (in my case) ~70MB to be read, which include the kernel, initrd and all files read by systemd and daemons, until a command prompt appears. Values read "rd": 2524, "rd_bytes": 69685248, "cache_ops_hit": 228, "cache_ops_miss": 2268, "cache_bytes_hit": 90353664, "cache_bytes_miss": 63902720, "data_read": 69186560, "op": 2295, "op_r": 2279, That is 2299 objects being fetched from the OSD to read 69MB. - Greping inside the linux source code (833MB), takes almost 3 minutes. Values get increased to: "rd": 65127, "rd_bytes": 1081487360, "cache_ops_hit": 228, "cache_ops_miss": 64885, "cache_bytes_hit": 90353664, "cache_bytes_miss": 1075672064, "data_read": 1080988672, "op_r": 64896, That is over 60.000 objects fetched to read <1GB, and *0* cache hits. Optimized, this should take 10 seconds, and fetch ~700 objects. Is my Qemu implementation completely broken? Or is this expected? Please help! -- Ruben Rodriguez | Senior Systems Administrator, Free Software Foundation GPG Key: 05EF 1D2F FE61 747D 1FC8 27C3 7FAC 7D26 472F 4409 https://fsf.org | https://gnu.org _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com