Re: Slow rbd reads (fast writes) with luminous + bluestore

Florian Haas <florian@xxxxxxxxxxxxxx> · Thu, 13 Dec 2018 15:36:29 +0100

On 13/12/2018 15:10, Mark Nelson wrote:
> Hi Florian,
> 
> On 12/13/18 7:52 AM, Florian Haas wrote:
>> On 02/12/2018 19:48, Florian Haas wrote:
>>> Hi Mark,
>>>
>>> just taking the liberty to follow up on this one, as I'd really like to
>>> get to the bottom of this.
>>>
>>> On 28/11/2018 16:53, Florian Haas wrote:
>>>> On 28/11/2018 15:52, Mark Nelson wrote:
>>>>> Option("bluestore_default_buffered_read", Option::TYPE_BOOL,
>>>>> Option::LEVEL_ADVANCED)
>>>>>      .set_default(true)
>>>>>      .set_flag(Option::FLAG_RUNTIME)
>>>>>      .set_description("Cache read results by default (unless hinted
>>>>> NOCACHE or WONTNEED)"),
>>>>>
>>>>>      Option("bluestore_default_buffered_write", Option::TYPE_BOOL,
>>>>> Option::LEVEL_ADVANCED)
>>>>>      .set_default(false)
>>>>>      .set_flag(Option::FLAG_RUNTIME)
>>>>>      .set_description("Cache writes by default (unless hinted
>>>>> NOCACHE or
>>>>> WONTNEED)"),
>>>>>
>>>>>
>>>>> This is one area where bluestore is a lot more confusing for users
>>>>> that
>>>>> filestore was.  There was a lot of concern about enabling buffer cache
>>>>> on writes by default because there's some associated overhead
>>>>> (potentially both during writes and in the mempool thread when
>>>>> trimming
>>>>> the cache).  It might be worth enabling
>>>>> bluestore_default_buffered_write
>>>>> and see if it helps reads.
>>>> So yes this is rather counterintuitive, but I happily gave it a shot
>>>> and
>>>> the results are... more head-scratching than before. :)
>>>>
>>>> The output is here: http://paste.openstack.org/show/736324/
>>>>
>>>> In summary:
>>>>
>>>> 1. Write benchmark is in the same ballpark as before (good).
>>>>
>>>> 2. Read benchmark *without* readahead is *way* better than before
>>>> (splendid!) but has a weird dip down to 9K IOPS that I find
>>>> inexplicable. Any ideas on that?
>>>>
>>>> 3. Read benchmark *with* readahead is still abysmal, which I also find
>>>> rather odd. What do you think about that one?
>>> These two still confuse me.
>>>
>>> And in addition, I'm curious as to what you think of the approach to
>>> configure OSDs with bluestore_cache_kv_ratio = .49, so that rather
>>> than using 1%/99%/0% of cache memory for metadata/KV data/objects, the
>>> OSDs use 1%/49%/50%. Is this sensible? I assume the default of not using
>>> any memory to actually cache object data is there for a reason, but I am
>>> struggling to grasp what that reason would be. Particularly since in
>>> filestore, we always got in-memory object caching for free, via the page
>>> cache.
>> Hi Mark,
>>
>> do you mind if I give this another poke?
> 
> 
> Sorry, I got super busy with things and totally forgot about this.

No worries at all; that's why I follow up. :)

> Weird dips always make me think compaction.  Once thing we've seen is
> that compaction can force the entire cache to flush, invalidate all of
> the indexes/filters, and generally slow everything down.  If you still
> have the OSD log you can run this tool to get compaction event stats
> (and restrict it to certain level compactions if you like):
> 
> https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py

OK — in case this yields that it *is* indeed compaction that's a
potential culprit, what would be the remedy? I see there's a rather
opaque catch-all bluestore_rocksdb_options option where we can override
the compaction_readahead_size, and then there are some BlueFS log
compaction settings, but those don't seem to be very much applicable here.

> No idea why readahead would be that much slower.  We just saw a case
> where large sequential reads were incredibly slow with certain NVMe
> drives and LVM that were fixed by a kernel upgrade, but that was a very
> specific case.

Guess there's not that much to do here but leave it disabled, then?

> Regarding meta/kv/data ratios:  It's really tough to configure optimal
> settings for all situations.  Generally for RGW you need more KV cache
> and for RBD you need more meta cache, but it's highly variable (ie even
> in the RBD case you need enough KV cache to make sure all
> indexes/filters are cached, and in the RGW case you still may want to
> prioritize hot bluestore onodes).  That's why I started writing the
> autotuning code.  Because the cache is hierarchical, the worst case
> situation is that you just end up caching the same onode data twice in
> both places (especially if you end up forcing out omap data you need
> cached).  The best case situation is that you cache the most useful
> recent data with as little double caching as possible.  That's sort of
> the direction I'm trying to head with the autotuner.

You mentioned KV cache and meta cache, but I'm afraid that doesn't quite
address my question about a non-zero data cache. Does setting a non-zero
data cache never make sense?

Also, given that this is turning into rather deep black magic by itself,
what do you think about a recommendation to keep using filestore until a
cache autotuner is actually available?

Cheers,
Florian
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com