On Fri, Jul 14, 2017 at 3:43 PM, Ruben Rodriguez <ruben@xxxxxxx> wrote: > > I'm having an issue with small sequential reads (such as searching > through source code files, etc), and I found that multiple small reads > withing a 4MB boundary would fetch the same object from the OSD multiple > times, as it gets inserted into the RBD cache partially. > > How to reproduce: rbd image accessed from a Qemu vm using virtio-scsi, > writethrough cache on. Monitor with perf dump on the rbd client. The > image is filled up with zeroes in advance. Rbd readahead is off. > > 1 - Small read from a previously unread section of the disk: > dd if=/dev/sdb ibs=512 count=1 skip=41943040 iflag=skip_bytes > Notes: dd cannot read less than 512 bytes. The skip is arbitrary to > avoid the beginning of the disk, which would have been read at boot. > > Expected outcomes: perf dump should show a +1 increase on values rd, > cache_ops_miss and op_r. This happens correctly. > It should show a 4194304 increase in data_read as a whole object is put > into the cache. Instead it increases by 4096. (not sure why 4096, btw). > > 2 - Small read from less than 4MB distance (in the example, +5000b). > dd if=/dev/sdb ibs=512 count=1 skip=41948040 iflag=skip_bytes > Expected outcomes: perf dump should show a +1 increase on cache_ops_hit. > Instead cache_ops_miss increases. > It should show a 4194304 increase in data_read as a whole object is put > into the cache. Instead it increases by 4096. > op_r should not increase. Instead it increases by one, indicating that > the object was fetched again. > > My tests show that this could be causing a 6 to 20-fold performance loss > in small sequential reads. > > Is it by design that the RBD cache only inserts the portion requested by > the client instead of the whole last object fetched? Could it be a > tunable in any of my layers (fs, block device, qemu, rbd...) that is > preventing this? I don't know the exact readahead default values in that stack, but there's no general reason to think RBD (or any Ceph component) will read a whole object at a time. In this case, you're asking for 512 bytes and it appears to have turned that into a 4KB read (probably the virtual block size in use?), which seems pretty reasonable — if you were asking for 512 bytes out of every 4MB and it was reading 4MB each time, you'd probably be wondering why you were only getting 1/8192 the expected bandwidth. ;) -Greg > > Regards, > -- > Ruben Rodriguez | Senior Systems Administrator, Free Software Foundation > GPG Key: 05EF 1D2F FE61 747D 1FC8 27C3 7FAC 7D26 472F 4409 > https://fsf.org | https://gnu.org > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com