Are you 100% positive that your files are actually stored sequentially on the block device? I would recommend running blktrace to verify the IO pattern from your use-case. On Sat, Jul 15, 2017 at 5:42 PM, Ruben Rodriguez <ruben@xxxxxxx> wrote: > > > On 15/07/17 09:43, Nick Fisk wrote: >>> -----Original Message----- >>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of >>> Gregory Farnum >>> Sent: 15 July 2017 00:09 >>> To: Ruben Rodriguez <ruben@xxxxxxx> >>> Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx> >>> Subject: Re: RBD cache being filled up in small increases instead >>> of 4MB >>> >>> On Fri, Jul 14, 2017 at 3:43 PM, Ruben Rodriguez <ruben@xxxxxxx> wrote: >>>> >>>> I'm having an issue with small sequential reads (such as searching >>>> through source code files, etc), and I found that multiple small reads >>>> withing a 4MB boundary would fetch the same object from the OSD >>>> multiple times, as it gets inserted into the RBD cache partially. >>>> >>>> How to reproduce: rbd image accessed from a Qemu vm using virtio-scsi, >>>> writethrough cache on. Monitor with perf dump on the rbd client. The >>>> image is filled up with zeroes in advance. Rbd readahead is off. >>>> >>>> 1 - Small read from a previously unread section of the disk: >>>> dd if=/dev/sdb ibs=512 count=1 skip=41943040 iflag=skip_bytes >>>> Notes: dd cannot read less than 512 bytes. The skip is arbitrary to >>>> avoid the beginning of the disk, which would have been read at boot. >>>> >>>> Expected outcomes: perf dump should show a +1 increase on values rd, >>>> cache_ops_miss and op_r. This happens correctly. >>>> It should show a 4194304 increase in data_read as a whole object is >>>> put into the cache. Instead it increases by 4096. (not sure why 4096, btw). >>>> >>>> 2 - Small read from less than 4MB distance (in the example, +5000b). >>>> dd if=/dev/sdb ibs=512 count=1 skip=41948040 iflag=skip_bytes Expected >>>> outcomes: perf dump should show a +1 increase on cache_ops_hit. >>>> Instead cache_ops_miss increases. >>>> It should show a 4194304 increase in data_read as a whole object is >>>> put into the cache. Instead it increases by 4096. >>>> op_r should not increase. Instead it increases by one, indicating that >>>> the object was fetched again. >>>> >>>> My tests show that this could be causing a 6 to 20-fold performance >>>> loss in small sequential reads. >>>> >>>> Is it by design that the RBD cache only inserts the portion requested >>>> by the client instead of the whole last object fetched? Could it be a >>>> tunable in any of my layers (fs, block device, qemu, rbd...) that is >>>> preventing this? >>> >>> I don't know the exact readahead default values in that stack, but there's no >>> general reason to think RBD (or any Ceph component) will read a whole >>> object at a time. In this case, you're asking for 512 bytes and it appears to >>> have turned that into a 4KB read (probably the virtual block size in use?), >>> which seems pretty reasonable — if you were asking for 512 bytes out of >>> every 4MB and it was reading 4MB each time, you'd probably be wondering >>> why you were only getting 1/8192 the expected bandwidth. ;) -Greg >> >> I think the general readahead logic might be a bit more advanced in the Linux kernel vs using readahead from the librbd client. > > Yes, the problems I'm having should be corrected by the vm kernel > issuing larger reads, but I'm failing to get that to happen. > >> The kernel will watch how successful each readahead is and scale as necessary. You might want to try uping the read_ahead_kb for the block device in the VM. Something between 4MB to 32MB works well for RBD's, but make sure you have a 4.x kernel as some fixes to readahead max size were introduced and not sure if they ever got backported. > > I'm using kernel 4.4 and 4.8. I have readahead, min_io_size, > optimum_io_size and max_sectors_kb set to 4MB. It helps in some use > cases, like fio or dd tests, but not with real world tests like cp, > grep, tar on a large pool of small files. > > From all I can tell, optimal read performance would happen when the vm > kernel reads in 4MB increases _every_ _time_. I can force that with an > ugly hack (putting the files inside a formatted big file, mounted as > loop) and gives a 20-fold performance gain. But that is just silly... > > I documented that experiment on this thread: > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018924.html > >> Unless you tell the rbd client to not disable readahead after reading the 1st x number of bytes (rbd readahead disable after bytes=0), it will stop reading ahead and will only cache exactly what is requested by the client. > > I realized that, so as a proof of concept I made some changes to the > readahead mechanism. I force it on, make it trigger every time, and made > the max and min readahead size be 4MB. This way I ensure whole objects > get into the cache, and I get a 6-fold performance gain reading small files. > > This is just a proof of concept, I don't advocate for this behavior to > be implemented by the readahead function. Ideally it should be up to the > client to issue the correct read sizes. But what if the client is > faulty? I think it could be useful to have the option to tell librbd to > cache whole objects. > > -- > Ruben Rodriguez | Senior Systems Administrator, Free Software Foundation > GPG Key: 05EF 1D2F FE61 747D 1FC8 27C3 7FAC 7D26 472F 4409 > https://fsf.org | https://gnu.org > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Jason _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com