Re: RBD cache being filled up in small increases instead of 4MB

Ruben Rodriguez <ruben@xxxxxxx> · Sat, 15 Jul 2017 17:42:20 -0400

On 15/07/17 09:43, Nick Fisk wrote:
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
>> Gregory Farnum
>> Sent: 15 July 2017 00:09
>> To: Ruben Rodriguez <ruben@xxxxxxx>
>> Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
>> Subject: Re:  RBD cache being filled up in small increases instead
>> of 4MB
>>
>> On Fri, Jul 14, 2017 at 3:43 PM, Ruben Rodriguez <ruben@xxxxxxx> wrote:
>>>
>>> I'm having an issue with small sequential reads (such as searching
>>> through source code files, etc), and I found that multiple small reads
>>> withing a 4MB boundary would fetch the same object from the OSD
>>> multiple times, as it gets inserted into the RBD cache partially.
>>>
>>> How to reproduce: rbd image accessed from a Qemu vm using virtio-scsi,
>>> writethrough cache on. Monitor with perf dump on the rbd client. The
>>> image is filled up with zeroes in advance. Rbd readahead is off.
>>>
>>> 1 - Small read from a previously unread section of the disk:
>>> dd if=/dev/sdb ibs=512 count=1 skip=41943040 iflag=skip_bytes
>>> Notes: dd cannot read less than 512 bytes. The skip is arbitrary to
>>> avoid the beginning of the disk, which would have been read at boot.
>>>
>>> Expected outcomes: perf dump should show a +1 increase on values rd,
>>> cache_ops_miss and op_r. This happens correctly.
>>> It should show a 4194304 increase in data_read as a whole object is
>>> put into the cache. Instead it increases by 4096. (not sure why 4096, btw).
>>>
>>> 2 - Small read from less than 4MB distance (in the example, +5000b).
>>> dd if=/dev/sdb ibs=512 count=1 skip=41948040 iflag=skip_bytes Expected
>>> outcomes: perf dump should show a +1 increase on cache_ops_hit.
>>> Instead cache_ops_miss increases.
>>> It should show a 4194304 increase in data_read as a whole object is
>>> put into the cache. Instead it increases by 4096.
>>> op_r should not increase. Instead it increases by one, indicating that
>>> the object was fetched again.
>>>
>>> My tests show that this could be causing a 6 to 20-fold performance
>>> loss in small sequential reads.
>>>
>>> Is it by design that the RBD cache only inserts the portion requested
>>> by the client instead of the whole last object fetched? Could it be a
>>> tunable in any of my layers (fs, block device, qemu, rbd...) that is
>>> preventing this?
>>
>> I don't know the exact readahead default values in that stack, but there's no
>> general reason to think RBD (or any Ceph component) will read a whole
>> object at a time. In this case, you're asking for 512 bytes and it appears to
>> have turned that into a 4KB read (probably the virtual block size in use?),
>> which seems pretty reasonable — if you were asking for 512 bytes out of
>> every 4MB and it was reading 4MB each time, you'd probably be wondering
>> why you were only getting 1/8192 the expected bandwidth. ;) -Greg
> 
> I think the general readahead logic might be a bit more advanced in the Linux kernel vs using readahead from the librbd client.

Yes, the problems I'm having should be corrected by the vm kernel
issuing larger reads, but I'm failing to get that to happen.

> The kernel will watch how successful each readahead is and scale as necessary. You might want to try uping the read_ahead_kb for the block device in the VM. Something between 4MB to 32MB works well for RBD's, but make sure you have a 4.x kernel as some fixes to readahead max size were introduced and not sure if they ever got backported.

I'm using kernel 4.4 and 4.8. I have readahead, min_io_size,
optimum_io_size and max_sectors_kb set to 4MB. It helps in some use
cases, like fio or dd tests, but not with real world tests like cp,
grep, tar on a large pool of small files.

>From all I can tell, optimal read performance would happen when the vm
kernel reads in 4MB increases _every_ _time_. I can force that with an
ugly hack (putting the files inside a formatted big file, mounted as
loop) and gives a 20-fold performance gain. But that is just silly...

I documented that experiment on this thread:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018924.html

> Unless you tell the rbd client to not disable readahead after reading the 1st x number of bytes (rbd readahead disable after bytes=0), it will stop reading ahead and will only cache exactly what is requested by the client.

I realized that, so as a proof of concept I made some changes to the
readahead mechanism. I force it on, make it trigger every time, and made
the max and min readahead size be 4MB. This way I ensure whole objects
get into the cache, and I get a 6-fold performance gain reading small files.

This is just a proof of concept, I don't advocate for this behavior to
be implemented by the readahead function. Ideally it should be up to the
client to issue the correct read sizes. But what if the client is
faulty? I think it could be useful to have the option to tell librbd to
cache whole objects.

-- 
Ruben Rodriguez | Senior Systems Administrator, Free Software Foundation
GPG Key: 05EF 1D2F FE61 747D 1FC8 27C3 7FAC 7D26 472F 4409
https://fsf.org | https://gnu.org
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com