On 13/12/2018 15:10, Mark Nelson wrote: > Hi Florian, > > On 12/13/18 7:52 AM, Florian Haas wrote: >> On 02/12/2018 19:48, Florian Haas wrote: >>> Hi Mark, >>> >>> just taking the liberty to follow up on this one, as I'd really like to >>> get to the bottom of this. >>> >>> On 28/11/2018 16:53, Florian Haas wrote: >>>> On 28/11/2018 15:52, Mark Nelson wrote: >>>>> Option("bluestore_default_buffered_read", Option::TYPE_BOOL, >>>>> Option::LEVEL_ADVANCED) >>>>> .set_default(true) >>>>> .set_flag(Option::FLAG_RUNTIME) >>>>> .set_description("Cache read results by default (unless hinted >>>>> NOCACHE or WONTNEED)"), >>>>> >>>>> Option("bluestore_default_buffered_write", Option::TYPE_BOOL, >>>>> Option::LEVEL_ADVANCED) >>>>> .set_default(false) >>>>> .set_flag(Option::FLAG_RUNTIME) >>>>> .set_description("Cache writes by default (unless hinted >>>>> NOCACHE or >>>>> WONTNEED)"), >>>>> >>>>> >>>>> This is one area where bluestore is a lot more confusing for users >>>>> that >>>>> filestore was. There was a lot of concern about enabling buffer cache >>>>> on writes by default because there's some associated overhead >>>>> (potentially both during writes and in the mempool thread when >>>>> trimming >>>>> the cache). It might be worth enabling >>>>> bluestore_default_buffered_write >>>>> and see if it helps reads. >>>> So yes this is rather counterintuitive, but I happily gave it a shot >>>> and >>>> the results are... more head-scratching than before. :) >>>> >>>> The output is here: http://paste.openstack.org/show/736324/ >>>> >>>> In summary: >>>> >>>> 1. Write benchmark is in the same ballpark as before (good). >>>> >>>> 2. Read benchmark *without* readahead is *way* better than before >>>> (splendid!) but has a weird dip down to 9K IOPS that I find >>>> inexplicable. Any ideas on that? >>>> >>>> 3. Read benchmark *with* readahead is still abysmal, which I also find >>>> rather odd. What do you think about that one? >>> These two still confuse me. >>> >>> And in addition, I'm curious as to what you think of the approach to >>> configure OSDs with bluestore_cache_kv_ratio = .49, so that rather >>> than using 1%/99%/0% of cache memory for metadata/KV data/objects, the >>> OSDs use 1%/49%/50%. Is this sensible? I assume the default of not using >>> any memory to actually cache object data is there for a reason, but I am >>> struggling to grasp what that reason would be. Particularly since in >>> filestore, we always got in-memory object caching for free, via the page >>> cache. >> Hi Mark, >> >> do you mind if I give this another poke? > > > Sorry, I got super busy with things and totally forgot about this. No worries at all; that's why I follow up. :) > Weird dips always make me think compaction. Once thing we've seen is > that compaction can force the entire cache to flush, invalidate all of > the indexes/filters, and generally slow everything down. If you still > have the OSD log you can run this tool to get compaction event stats > (and restrict it to certain level compactions if you like): > > https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py OK — in case this yields that it *is* indeed compaction that's a potential culprit, what would be the remedy? I see there's a rather opaque catch-all bluestore_rocksdb_options option where we can override the compaction_readahead_size, and then there are some BlueFS log compaction settings, but those don't seem to be very much applicable here. > No idea why readahead would be that much slower. We just saw a case > where large sequential reads were incredibly slow with certain NVMe > drives and LVM that were fixed by a kernel upgrade, but that was a very > specific case. Guess there's not that much to do here but leave it disabled, then? > Regarding meta/kv/data ratios: It's really tough to configure optimal > settings for all situations. Generally for RGW you need more KV cache > and for RBD you need more meta cache, but it's highly variable (ie even > in the RBD case you need enough KV cache to make sure all > indexes/filters are cached, and in the RGW case you still may want to > prioritize hot bluestore onodes). That's why I started writing the > autotuning code. Because the cache is hierarchical, the worst case > situation is that you just end up caching the same onode data twice in > both places (especially if you end up forcing out omap data you need > cached). The best case situation is that you cache the most useful > recent data with as little double caching as possible. That's sort of > the direction I'm trying to head with the autotuner. You mentioned KV cache and meta cache, but I'm afraid that doesn't quite address my question about a non-zero data cache. Does setting a non-zero data cache never make sense? Also, given that this is turning into rather deep black magic by itself, what do you think about a recommendation to keep using filestore until a cache autotuner is actually available? Cheers, Florian _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com