Re: RBD cache being filled up in small increases instead of 4MB

Jason Dillaman <jdillama@xxxxxxxxxx> · Mon, 17 Jul 2017 08:02:09 -0400

Are you 100% positive that your files are actually stored sequentially
on the block device? I would recommend running blktrace to verify the
IO pattern from your use-case.

On Sat, Jul 15, 2017 at 5:42 PM, Ruben Rodriguez <ruben@xxxxxxx> wrote:
>
>
> On 15/07/17 09:43, Nick Fisk wrote:
>>> -----Original Message-----
>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
>>> Gregory Farnum
>>> Sent: 15 July 2017 00:09
>>> To: Ruben Rodriguez <ruben@xxxxxxx>
>>> Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
>>> Subject: Re:  RBD cache being filled up in small increases instead
>>> of 4MB
>>>
>>> On Fri, Jul 14, 2017 at 3:43 PM, Ruben Rodriguez <ruben@xxxxxxx> wrote:
>>>>
>>>> I'm having an issue with small sequential reads (such as searching
>>>> through source code files, etc), and I found that multiple small reads
>>>> withing a 4MB boundary would fetch the same object from the OSD
>>>> multiple times, as it gets inserted into the RBD cache partially.
>>>>
>>>> How to reproduce: rbd image accessed from a Qemu vm using virtio-scsi,
>>>> writethrough cache on. Monitor with perf dump on the rbd client. The
>>>> image is filled up with zeroes in advance. Rbd readahead is off.
>>>>
>>>> 1 - Small read from a previously unread section of the disk:
>>>> dd if=/dev/sdb ibs=512 count=1 skip=41943040 iflag=skip_bytes
>>>> Notes: dd cannot read less than 512 bytes. The skip is arbitrary to
>>>> avoid the beginning of the disk, which would have been read at boot.
>>>>
>>>> Expected outcomes: perf dump should show a +1 increase on values rd,
>>>> cache_ops_miss and op_r. This happens correctly.
>>>> It should show a 4194304 increase in data_read as a whole object is
>>>> put into the cache. Instead it increases by 4096. (not sure why 4096, btw).
>>>>
>>>> 2 - Small read from less than 4MB distance (in the example, +5000b).
>>>> dd if=/dev/sdb ibs=512 count=1 skip=41948040 iflag=skip_bytes Expected
>>>> outcomes: perf dump should show a +1 increase on cache_ops_hit.
>>>> Instead cache_ops_miss increases.
>>>> It should show a 4194304 increase in data_read as a whole object is
>>>> put into the cache. Instead it increases by 4096.
>>>> op_r should not increase. Instead it increases by one, indicating that
>>>> the object was fetched again.
>>>>
>>>> My tests show that this could be causing a 6 to 20-fold performance
>>>> loss in small sequential reads.
>>>>
>>>> Is it by design that the RBD cache only inserts the portion requested
>>>> by the client instead of the whole last object fetched? Could it be a
>>>> tunable in any of my layers (fs, block device, qemu, rbd...) that is
>>>> preventing this?
>>>
>>> I don't know the exact readahead default values in that stack, but there's no
>>> general reason to think RBD (or any Ceph component) will read a whole
>>> object at a time. In this case, you're asking for 512 bytes and it appears to
>>> have turned that into a 4KB read (probably the virtual block size in use?),
>>> which seems pretty reasonable — if you were asking for 512 bytes out of
>>> every 4MB and it was reading 4MB each time, you'd probably be wondering
>>> why you were only getting 1/8192 the expected bandwidth. ;) -Greg
>>
>> I think the general readahead logic might be a bit more advanced in the Linux kernel vs using readahead from the librbd client.
>
> Yes, the problems I'm having should be corrected by the vm kernel
> issuing larger reads, but I'm failing to get that to happen.
>
>> The kernel will watch how successful each readahead is and scale as necessary. You might want to try uping the read_ahead_kb for the block device in the VM. Something between 4MB to 32MB works well for RBD's, but make sure you have a 4.x kernel as some fixes to readahead max size were introduced and not sure if they ever got backported.
>
> I'm using kernel 4.4 and 4.8. I have readahead, min_io_size,
> optimum_io_size and max_sectors_kb set to 4MB. It helps in some use
> cases, like fio or dd tests, but not with real world tests like cp,
> grep, tar on a large pool of small files.
>
> From all I can tell, optimal read performance would happen when the vm
> kernel reads in 4MB increases _every_ _time_. I can force that with an
> ugly hack (putting the files inside a formatted big file, mounted as
> loop) and gives a 20-fold performance gain. But that is just silly...
>
> I documented that experiment on this thread:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018924.html
>
>> Unless you tell the rbd client to not disable readahead after reading the 1st x number of bytes (rbd readahead disable after bytes=0), it will stop reading ahead and will only cache exactly what is requested by the client.
>
> I realized that, so as a proof of concept I made some changes to the
> readahead mechanism. I force it on, make it trigger every time, and made
> the max and min readahead size be 4MB. This way I ensure whole objects
> get into the cache, and I get a 6-fold performance gain reading small files.
>
> This is just a proof of concept, I don't advocate for this behavior to
> be implemented by the readahead function. Ideally it should be up to the
> client to issue the correct read sizes. But what if the client is
> faulty? I think it could be useful to have the option to tell librbd to
> cache whole objects.
>
> --
> Ruben Rodriguez | Senior Systems Administrator, Free Software Foundation
> GPG Key: 05EF 1D2F FE61 747D 1FC8 27C3 7FAC 7D26 472F 4409
> https://fsf.org | https://gnu.org
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com