On Sat, Jul 15, 2017 at 8:00 PM, Ruben Rodriguez <ruben@xxxxxxx> wrote: > > > On 14/07/17 18:43, Ruben Rodriguez wrote: >> How to reproduce... > > I'll provide more concise details on how to test this behavior: > > Ceph config: > > [client] > rbd readahead max bytes = 0 # we don't want forced readahead to fool us > rbd cache = true > > Start a qemu vm, with a rbd image attached with virtio-scsi: > <disk type='network' device='disk'> > <driver name='qemu' type='raw' cache='writeback'/> > <auth username='libvirt'> > <secret type='ceph' uuid='...'/> > </auth> > <source protocol='rbd' name='libvirt-pool/test'> > <host name='cephmon1' port='6789'/> > <host name='cephmon2' port='6789'/> > <host name='cephmon3' port='6789'/> > </source> > <blockio logical_block_size='512' physical_block_size='512'/> > <target dev='sdb' bus='scsi'/> > <boot order='2'/> > <address type='drive' controller='0' bus='0' target='0' unit='1'/> > </disk> > > Block device parameters, inside the vm: > NAME ALIGN MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME > sdb 0 4194304 4194304 512 512 1 noop 128 4096 2G > > Collect performance statistics from librbd, using command: > > $ ceph --admin-daemon /var/run/ceph/ceph-client.[...].asok perf dump > > Note the values for: > - rd: number of read operations done by qemu > - rd_bytes: length of read requests done by qemu > - cache_ops_hit: read operations hitting the cache > - cache_ops_miss: read ops missing the cache > - data_read: data read from the cache > - op_r: number of objects sent by the OSD > > Perform one small read, not at the beginning of the image (because udev > may have read it already), at a 4MB boundary line: > > dd if=/dev/sda ibs=512 count=1 skip=41943040 iflag=skip_bytes > > Do it again advancing 5000 bytes (to not overlap with the previous read) > Run the perf dump command again > > dd if=/dev/sda ibs=512 count=1 skip=41948040 iflag=skip_bytes > Run the perf dump command again > > If you compare the op_r values at each step, you should see a cache miss > each time, and a object read each time. Same object fetched twice. > > IMPACT: > > Let's take a look at how the op_r value increases by doing some common > operations: > > - Booting a vm: This operation needs (in my case) ~70MB to be read, > which include the kernel, initrd and all files read by systemd and > daemons, until a command prompt appears. Values read > "rd": 2524, > "rd_bytes": 69685248, > "cache_ops_hit": 228, > "cache_ops_miss": 2268, > "cache_bytes_hit": 90353664, > "cache_bytes_miss": 63902720, > "data_read": 69186560, > "op": 2295, > "op_r": 2279, > That is 2299 objects being fetched from the OSD to read 69MB. > > - Greping inside the linux source code (833MB), takes almost 3 minutes. > Values get increased to: > "rd": 65127, > "rd_bytes": 1081487360, > "cache_ops_hit": 228, > "cache_ops_miss": 64885, > "cache_bytes_hit": 90353664, > "cache_bytes_miss": 1075672064, > "data_read": 1080988672, > "op_r": 64896, > That is over 60.000 objects fetched to read <1GB, and *0* cache hits. > Optimized, this should take 10 seconds, and fetch ~700 objects. > > Is my Qemu implementation completely broken? Or is this expected? Please > help! I recommend watching the IO patterns via blktrace. The "60.000 objects fetched" is a semi-misnomer -- it's saying that ~65,000 individual IO operations were sent to the OSD. This doesn't imply that the operations are against unique objects (i.e. there might be a lot of ops hitting the same object). Your average IO size is at least 16K so there must be some level of OS request merging / readhead since otherwise I would expect ~2 million 512 byte IO requests. > > -- > Ruben Rodriguez | Senior Systems Administrator, Free Software Foundation > GPG Key: 05EF 1D2F FE61 747D 1FC8 27C3 7FAC 7D26 472F 4409 > https://fsf.org | https://gnu.org > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Jason _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com