Re: osd_memory_target in Rook, set automatically via k8s Resource Requests, and hardware recommendations

Mark Nelson <mnelson@xxxxxxxxxx> · Tue, 12 May 2020 14:58:00 -0500

On 5/12/20 1:57 PM, Blaine Gardner wrote:

(Please use reply-all so that people I've explicitly tagged and I can continue to get direct email replies)

I have a conundrum around developing best practices for Rook/Ceph clusters around OSD memory targets and hardware recommendations for OSDs. I want to lay down some bulleted notes first.
- 'osd_memory_target' defaults to 4GB

Correct

- OSDs attempt to keep memory allocation to 'osd_memory_target'
Specifically the OSD will attempt to keep the mapped memory below the 
osd_memory_target but with potential, hopefully small, temporary 
overages.  The bigger a recent overage is the more aggressively the 
priority cache manager will react when decreasing cache memory 
allocation.  All of this is in relation to mapped memory however, not 
RSS memory.
- BUT... this is only best-effort

Correct

- AND... there is no guarantee that the kernel will actually reclaim memory that OSDs release/unmap
Correct, we can't explicitly force the kernel to reclaim memory.
- Therefore, we (SUSE) have developed a recommendation that ...
    Total OSD node RAM required = (num OSDs) x (1 GB + osd_memory_target) + 16 GB

That's probably fairly reasonable in most cases though off at the 
extremes (dedicating 21GB for a single OSD with a 4 GB memory target is 
almost certainly overkill).

- In Rook/Kubernetes, Ceph OSDs will read the POD_MEMORY_REQUEST and POD_MEMORY_LIMIT env vars to infer a new default value for 'osd_memory_target'
    - POD_MEMORY_REQUEST translates directly to 'osd_memory_target' 1:1
    - POD_MEMORY_LIMIT (if REQUEST is unset) will set 'osd_memory_target' using the formula ( LIMIT x osd_memory_target_cgroup_limit_ratio )
    - the default 'osd_memory_target' will be  = min(REQUEST, LIMIT*ratio)
- Lars has suggested that setting limits is not a best practice for Ceph; when limits are encountered, Ceph is likely in a failure state, and killing daemons could result in a "thundering herds" distributed systems problem

As you can see, there is a self-referential problem here. The OSD hardware recommendation should inform us how to set k8s resource limits for OSDs; however, doing so will affect osd_memory_target, which alters the recommendation, which further alters our k8s resource limit circularly forever.

My take is that unless we are strictly allocating memory from 
pre-allocated pools (which we are not) we can't make any guarantee that 
a Ceph daemon is going to fit within a specific memory allocation at any 
given point in time.  The osd_memory_target code works surprisingly well 
for what it is, but as discussed above it's not a guarantee (which is 
precisely why it's called a target).  That means we can either leave the 
limits off, or set them to something high enough where we feel 
relatively confident the OSD won't go OOM, but we'll still catch a 
misbehaving daemon before it takes other containers (or the whole 
container node) down.

We can address this issue with a semi-workaround currently:
set osd_memory_target explicitly in Ceph's config, and set an appropriate k8s resource request matching (osd_memory_target + 1GB + some extra) to meet the hardware recommendation. However, means that the Ceph feature of setting osd_memory_target based on resource requests isn't really used because it doesn't behave to actual best practices. And setting a realistic k8s resource request is useful for kubernetes so that it won't schedule more daemons onto a node than the node can realistically support.

There really are multiple ways an OSD can use more RSS memory than 
expected.  We can have temporary overages due to sudden allocations in 
the OSD that we haven't yet compensated for (We try to keep some wiggle 
room and periodically check memory usage to adjust the caches down, but 
this doesn't happen instantaneously). You can have fragmentation below 
the OSD where we might have freed memory that can't be reclaimed.  They 
may be a large memory allocation that we can't fully compensate for even 
after shrinking caches to their minimums.  This is especially true for 
smaller than default osd_memory_targets.  The more memory you give each 
container over the target the less likely you are to hit a situation 
where a spike causes an OOM kill.  How much extra you need per container 
is incredibly complicated and based on a number of factors including 
fragmentation, transparent huge pages settings, PG count, incoming write 
rate, rocksdb memtable sizes, rocksdb memtable flushing speed, rocksdb 
compaction rate, osd_memory_target, bluestore_cache_autotune_chunk_size, 
osd_memory_expected_fragmentation, osd_memory_cache_resize_interval, and 
various other things. Potentially if there was some way for the OSD to 
negotiate with the container it might be able to adjust that overage 
amount on the fly (say start at 1GB and adjust over time based on 
historical overages), but right now I don't think there's any capability 
to do that.||||||||||||||||||

Long-term, I wonder if it is good to add into Ceph a computation that [[ osd_memory_target = REQUEST + osd_memory_request_overhead ]] where the osd_memory_requests overhead defaults to 1GB or somewhat higher.

Originally I think the idea was we were going to give the OSD some 
percentage amount higher like 20% but we kept periodically exceeding it 
with the default 4MB target.  That also was when THP was enabled by 
default and we were seeing large amounts of memory space amplification 
due to fragmentation (and there may have been some other bugs regarding 
how the limit was calculated).  I'm not sure if there are currently any 
container limits in place at all, but one of the rook guys can probably 
say what the current status is.

Please discuss, and let me know if anything here seems like I've gotten it wrong or if there are other options I haven't seen.

Cheers, and happy Tuesday!
Blaine
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx