Hi Florian,
On 12/13/18 7:52 AM, Florian Haas wrote:
On 02/12/2018 19:48, Florian Haas wrote:
Hi Mark,
just taking the liberty to follow up on this one, as I'd really like to
get to the bottom of this.
On 28/11/2018 16:53, Florian Haas wrote:
On 28/11/2018 15:52, Mark Nelson wrote:
Option("bluestore_default_buffered_read", Option::TYPE_BOOL,
Option::LEVEL_ADVANCED)
.set_default(true)
.set_flag(Option::FLAG_RUNTIME)
.set_description("Cache read results by default (unless hinted
NOCACHE or WONTNEED)"),
Option("bluestore_default_buffered_write", Option::TYPE_BOOL,
Option::LEVEL_ADVANCED)
.set_default(false)
.set_flag(Option::FLAG_RUNTIME)
.set_description("Cache writes by default (unless hinted NOCACHE or
WONTNEED)"),
This is one area where bluestore is a lot more confusing for users that
filestore was. There was a lot of concern about enabling buffer cache
on writes by default because there's some associated overhead
(potentially both during writes and in the mempool thread when trimming
the cache). It might be worth enabling bluestore_default_buffered_write
and see if it helps reads.
So yes this is rather counterintuitive, but I happily gave it a shot and
the results are... more head-scratching than before. :)
The output is here: http://paste.openstack.org/show/736324/
In summary:
1. Write benchmark is in the same ballpark as before (good).
2. Read benchmark *without* readahead is *way* better than before
(splendid!) but has a weird dip down to 9K IOPS that I find
inexplicable. Any ideas on that?
3. Read benchmark *with* readahead is still abysmal, which I also find
rather odd. What do you think about that one?
These two still confuse me.
And in addition, I'm curious as to what you think of the approach to
configure OSDs with bluestore_cache_kv_ratio = .49, so that rather
than using 1%/99%/0% of cache memory for metadata/KV data/objects, the
OSDs use 1%/49%/50%. Is this sensible? I assume the default of not using
any memory to actually cache object data is there for a reason, but I am
struggling to grasp what that reason would be. Particularly since in
filestore, we always got in-memory object caching for free, via the page
cache.
Hi Mark,
do you mind if I give this another poke?
Sorry, I got super busy with things and totally forgot about this.
Weird dips always make me think compaction. Once thing we've seen is
that compaction can force the entire cache to flush, invalidate all of
the indexes/filters, and generally slow everything down. If you still
have the OSD log you can run this tool to get compaction event stats
(and restrict it to certain level compactions if you like):
https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py
No idea why readahead would be that much slower. We just saw a case
where large sequential reads were incredibly slow with certain NVMe
drives and LVM that were fixed by a kernel upgrade, but that was a very
specific case.
Regarding meta/kv/data ratios: It's really tough to configure optimal
settings for all situations. Generally for RGW you need more KV cache
and for RBD you need more meta cache, but it's highly variable (ie even
in the RBD case you need enough KV cache to make sure all
indexes/filters are cached, and in the RGW case you still may want to
prioritize hot bluestore onodes). That's why I started writing the
autotuning code. Because the cache is hierarchical, the worst case
situation is that you just end up caching the same onode data twice in
both places (especially if you end up forcing out omap data you need
cached). The best case situation is that you cache the most useful
recent data with as little double caching as possible. That's sort of
the direction I'm trying to head with the autotuner.
Cheers,
Florian
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com