Re: Slow rbd reads (fast writes) with luminous + bluestore

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Florian,

On 12/13/18 7:52 AM, Florian Haas wrote:
On 02/12/2018 19:48, Florian Haas wrote:
Hi Mark,

just taking the liberty to follow up on this one, as I'd really like to
get to the bottom of this.

On 28/11/2018 16:53, Florian Haas wrote:
On 28/11/2018 15:52, Mark Nelson wrote:
Option("bluestore_default_buffered_read", Option::TYPE_BOOL,
Option::LEVEL_ADVANCED)
     .set_default(true)
     .set_flag(Option::FLAG_RUNTIME)
     .set_description("Cache read results by default (unless hinted
NOCACHE or WONTNEED)"),

     Option("bluestore_default_buffered_write", Option::TYPE_BOOL,
Option::LEVEL_ADVANCED)
     .set_default(false)
     .set_flag(Option::FLAG_RUNTIME)
     .set_description("Cache writes by default (unless hinted NOCACHE or
WONTNEED)"),


This is one area where bluestore is a lot more confusing for users that
filestore was.  There was a lot of concern about enabling buffer cache
on writes by default because there's some associated overhead
(potentially both during writes and in the mempool thread when trimming
the cache).  It might be worth enabling bluestore_default_buffered_write
and see if it helps reads.
So yes this is rather counterintuitive, but I happily gave it a shot and
the results are... more head-scratching than before. :)

The output is here: http://paste.openstack.org/show/736324/

In summary:

1. Write benchmark is in the same ballpark as before (good).

2. Read benchmark *without* readahead is *way* better than before
(splendid!) but has a weird dip down to 9K IOPS that I find
inexplicable. Any ideas on that?

3. Read benchmark *with* readahead is still abysmal, which I also find
rather odd. What do you think about that one?
These two still confuse me.

And in addition, I'm curious as to what you think of the approach to
configure OSDs with bluestore_cache_kv_ratio = .49, so that rather
than using 1%/99%/0% of cache memory for metadata/KV data/objects, the
OSDs use 1%/49%/50%. Is this sensible? I assume the default of not using
any memory to actually cache object data is there for a reason, but I am
struggling to grasp what that reason would be. Particularly since in
filestore, we always got in-memory object caching for free, via the page
cache.
Hi Mark,

do you mind if I give this another poke?


Sorry, I got super busy with things and totally forgot about this.  Weird dips always make me think compaction.  Once thing we've seen is that compaction can force the entire cache to flush, invalidate all of the indexes/filters, and generally slow everything down.  If you still have the OSD log you can run this tool to get compaction event stats (and restrict it to certain level compactions if you like):


https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py


No idea why readahead would be that much slower.  We just saw a case where large sequential reads were incredibly slow with certain NVMe drives and LVM that were fixed by a kernel upgrade, but that was a very specific case.

Regarding meta/kv/data ratios:  It's really tough to configure optimal settings for all situations.  Generally for RGW you need more KV cache and for RBD you need more meta cache, but it's highly variable (ie even in the RBD case you need enough KV cache to make sure all indexes/filters are cached, and in the RGW case you still may want to prioritize hot bluestore onodes).  That's why I started writing the autotuning code.  Because the cache is hierarchical, the worst case situation is that you just end up caching the same onode data twice in both places (especially if you end up forcing out omap data you need cached).  The best case situation is that you cache the most useful recent data with as little double caching as possible.  That's sort of the direction I'm trying to head with the autotuner.



Cheers,
Florian
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux