Re: What's the relationship between osd_memory_target and bluestore_cache_size?

Mark Nelson <mnelson@xxxxxxxxxx> · Tue, 29 Mar 2022 13:26:52 -0500

On 3/29/22 11:44, Anthony D'Atri wrote:
[osd]
bluestore_cache_autotune = 0
Why are you turning autotuning off?
FWIW I’ve encountered the below assertions.  I neither support nor deny them, pasting here for discussion.  One might interpret this to only apply to OSDs with DB on a seperate (faster) device.

With random small block workloads, it’s important to keep BlueStore metadata cached and keep RocksDB from spilling over to slow media – including during compaction. If there is adequate memory on the OSD node, it is recommended to increase the BlueStore metadata cache ratio. An example of this is shown below:

bluestore_cache_meta_ratio = 0.8
bluestore_cache_kv_ratio = 0.2
osd bluestore_cache_size_ssd 6GB

In Ceph Nautilus and above, the cache ratios are automatically tuned so it is recommended to first observe the relevant cache hit counters in BlueStore before manually setting these parameters.  There is some disagreement regarding how effective the auto tuning is.

https://ceph.io/community/bluestore-default-vs-tuned-performance-comparison/ suggests that we still set

bluestore_cache_size_ssd = 8GB with 12GB memory target.
Sorry, this is going to be a long... :)

The basic gist of it is that if you disable autotuning, the OSD will use 
a set "cache size" for various caches and then divvy the memory up 
between them based on the defined ratios.  For bluestore that means the 
rocksdb block cache(s), bluestore onode cache, and bluestore buffer 
cache.  IE in the above example that's 6GB with 80% going to onode 
"meta" cache, 20% going to the rocksdb block "kv" cache, and an implicit 
0% being dedicated to bluestore buffer cache.  This kind of setup tends 
to work best when you have a well defined workload and you know exactly 
how you want to tune the different cache sizes for optimal performance 
(often times giving a lot of the memory to onode cache for RBD for 
example).  The amount of memory the OSD uses can float up and down and 
it tends to be a little easier on tcmalloc because because you aren't 
growing/shrinking the caches constantly trying to stay within a certain 
memory target.

When cache autotuning is enabled, the cache size is allowed to fluctuate 
based on the osd_memory_target and how much memory is mapped by the 
ceph-osd process as reported by tcmalloc.  This is almost like using RSS 
memory as the target but not quite.  The difference is that there is no 
guarantee that the kernel will reclaim freed memory soon (or at all), so 
RSS memory usage ends up being a really poor metric for trying to 
dynamically adjust memory targets (I tried with fairly comical 
results).  This process of adjusting the caches based on a process level 
memory target seems to be harder on tcmalloc, probably because we're 
freeing a bunch of fragmented memory (say from the onode cache) while 
it's simultaneously trying to hand sequential chunks of memory out to 
something else (whatever is requesting memory and forcing us to go over 
target).  We tend to oscillate around the memory target, though over all 
the system works fairly well if you are willing to accept up to ~20% 
memory spikes under heavy (write) workloads. You can tweak the behavior 
to more aggressively try to control this by increasing the frequency 
that we recalculate the memory target, but it's more CPU intensive and 
may overcompensate by releasing too much fragmented memory too quickly.

Enabling autotuning also enables the priority cache manager. Each cache 
subsystem will request memory at different priority targets (say pri0, 
pri1, etc).  When autotuning is enabled the ratios no longer govern a 
global percentage of the cache, but instead govern a "fairshare" target 
at each priorirty level.  Each cache is assigned at least it's ratio of 
the available memory at a given level.  If a cache is assigned all of 
the memory it requests at that level, the prioirty cache manager will 
use left over memory to fulfill requests at that level by caches that 
want more memory than their faireshare target.  This process continues 
until all requests at a given level have been fulfilled or we run out of 
memory available for caches.  If all requests have been fulfilled at a 
given level, we move to the next level and start the process all over again.

In current versions of ceph we only really utilize 2 of the available 
levels.  Priority0 is used for very high priority things (like items 
pinned in the cache or rocksdb "hipri pool" items. Everything else is 
basically shoved into a single level and competes there.  In Quincy, we 
finally implemented age-binning, where we associate items in the 
different caches with "age bins" that give us a coarse look at the 
relative ages of all cache items.  IE say that there are old onode 
entries sitting in the bluestore onode cache, but now there is a really 
hot read workload against a single large object.  That OSD's priority 
cache can now sort those older onode entries into a lower priority level 
than the buffer cache data for the hot object.  We generally may heavily 
favor onodes at a given priority level, but in this case older onodes 
may end up in a lower priority level than the hot buffer data, so the 
buffer data memory request is fulfilled first.

Due to various factors this isn't as big of a win as I had hoped it 
would be (primarily in relation to the rocksdb block cache, since 
compaction tends to blow everything in the cache away regularly 
anyway).  In reality the biggest benefit seems to be that we are more 
aggressive about clearing away very old onode data if there are new 
writes which we suspect is reducing memory fragmentation, and it's much 
easier to tell the ages of items in the various caches via the perf 
admin socket.  It does give us significantly more control and insight 
into the global cache behavior though, so in general it seems to be a 
good thing.  The perf tests we ran ranged from having little effect to 
showing moderate improvement in some scenarios.

FWIW, despite the fact that I wrote the prioritycache system and memory 
autotuning code, I'd be much happier if we were much less dynamic about 
how we allocate memory.  That probably goes all the way back to how the 
message over the wire looks.  Ideally we would have a very short path 
from the message to the disk with minimal intermediate translation of 
the message, minimal dynamic behavior based on the content of the 
message, and recycling static buffers or objects from a contiguous pool 
whenever possible.  The prioritycache system tries to account for 
dynamic memory allocations in ceph by reactively growing/shrinking the 
caches, but it would be much better if we didn't need to do any of that 
in the first place.

Mark

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx