Re: RBD cache being filled up in small increases instead of 4MB

Nick Fisk <nick@xxxxxxxxxx> · Sat, 15 Jul 2017 14:43:38 +0100

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Gregory Farnum
> Sent: 15 July 2017 00:09
> To: Ruben Rodriguez <ruben@xxxxxxx>
> Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
> Subject: Re:  RBD cache being filled up in small increases instead
> of 4MB
> 
> On Fri, Jul 14, 2017 at 3:43 PM, Ruben Rodriguez <ruben@xxxxxxx> wrote:
> >
> > I'm having an issue with small sequential reads (such as searching
> > through source code files, etc), and I found that multiple small reads
> > withing a 4MB boundary would fetch the same object from the OSD
> > multiple times, as it gets inserted into the RBD cache partially.
> >
> > How to reproduce: rbd image accessed from a Qemu vm using virtio-scsi,
> > writethrough cache on. Monitor with perf dump on the rbd client. The
> > image is filled up with zeroes in advance. Rbd readahead is off.
> >
> > 1 - Small read from a previously unread section of the disk:
> > dd if=/dev/sdb ibs=512 count=1 skip=41943040 iflag=skip_bytes
> > Notes: dd cannot read less than 512 bytes. The skip is arbitrary to
> > avoid the beginning of the disk, which would have been read at boot.
> >
> > Expected outcomes: perf dump should show a +1 increase on values rd,
> > cache_ops_miss and op_r. This happens correctly.
> > It should show a 4194304 increase in data_read as a whole object is
> > put into the cache. Instead it increases by 4096. (not sure why 4096, btw).
> >
> > 2 - Small read from less than 4MB distance (in the example, +5000b).
> > dd if=/dev/sdb ibs=512 count=1 skip=41948040 iflag=skip_bytes Expected
> > outcomes: perf dump should show a +1 increase on cache_ops_hit.
> > Instead cache_ops_miss increases.
> > It should show a 4194304 increase in data_read as a whole object is
> > put into the cache. Instead it increases by 4096.
> > op_r should not increase. Instead it increases by one, indicating that
> > the object was fetched again.
> >
> > My tests show that this could be causing a 6 to 20-fold performance
> > loss in small sequential reads.
> >
> > Is it by design that the RBD cache only inserts the portion requested
> > by the client instead of the whole last object fetched? Could it be a
> > tunable in any of my layers (fs, block device, qemu, rbd...) that is
> > preventing this?
> 
> I don't know the exact readahead default values in that stack, but there's no
> general reason to think RBD (or any Ceph component) will read a whole
> object at a time. In this case, you're asking for 512 bytes and it appears to
> have turned that into a 4KB read (probably the virtual block size in use?),
> which seems pretty reasonable — if you were asking for 512 bytes out of
> every 4MB and it was reading 4MB each time, you'd probably be wondering
> why you were only getting 1/8192 the expected bandwidth. ;) -Greg

I think the general readahead logic might be a bit more advanced in the Linux kernel vs using readahead from the librbd client. The kernel will watch how successful each readahead is and scale as necessary. You might want to try uping the read_ahead_kb for the block device in the VM. Something between 4MB to 32MB works well for RBD's, but make sure you have a 4.x kernel as some fixes to readahead max size were introduced and not sure if they ever got backported.

Unless you tell the rbd client to not disable readahead after reading the 1st x number of bytes (rbd readahead disable after bytes=0), it will stop reading ahead and will only cache exactly what is requested by the client.

> 
> >
> > Regards,
> > --
> > Ruben Rodriguez | Senior Systems Administrator, Free Software
> > Foundation GPG Key: 05EF 1D2F FE61 747D 1FC8  27C3 7FAC 7D26 472F
> 4409
> > https://fsf.org | https://gnu.org
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com