Re: osd_memory_target in Rook, set automatically via k8s Resource Requests, and hardware recommendations

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 5/12/20 1:57 PM, Blaine Gardner wrote:

(Please use reply-all so that people I've explicitly tagged and I can continue to get direct email replies)

I have a conundrum around developing best practices for Rook/Ceph clusters around OSD memory targets and hardware recommendations for OSDs. I want to lay down some bulleted notes first.
- 'osd_memory_target' defaults to 4GB

Correct

- OSDs attempt to keep memory allocation to 'osd_memory_target'
Specifically the OSD will attempt to keep the mapped memory below the osd_memory_target but with potential, hopefully small, temporary overages.  The bigger a recent overage is the more aggressively the priority cache manager will react when decreasing cache memory allocation.  All of this is in relation to mapped memory however, not RSS memory.
- BUT... this is only best-effort

Correct

- AND... there is no guarantee that the kernel will actually reclaim memory that OSDs release/unmap
Correct, we can't explicitly force the kernel to reclaim memory.
- Therefore, we (SUSE) have developed a recommendation that ...
    Total OSD node RAM required = (num OSDs) x (1 GB + osd_memory_target) + 16 GB

That's probably fairly reasonable in most cases though off at the extremes (dedicating 21GB for a single OSD with a 4 GB memory target is almost certainly overkill).

- In Rook/Kubernetes, Ceph OSDs will read the POD_MEMORY_REQUEST and POD_MEMORY_LIMIT env vars to infer a new default value for 'osd_memory_target'
    - POD_MEMORY_REQUEST translates directly to 'osd_memory_target' 1:1
    - POD_MEMORY_LIMIT (if REQUEST is unset) will set 'osd_memory_target' using the formula ( LIMIT x osd_memory_target_cgroup_limit_ratio )
    - the default 'osd_memory_target' will be  = min(REQUEST, LIMIT*ratio)
- Lars has suggested that setting limits is not a best practice for Ceph; when limits are encountered, Ceph is likely in a failure state, and killing daemons could result in a "thundering herds" distributed systems problem

As you can see, there is a self-referential problem here. The OSD hardware recommendation should inform us how to set k8s resource limits for OSDs; however, doing so will affect osd_memory_target, which alters the recommendation, which further alters our k8s resource limit circularly forever.

My take is that unless we are strictly allocating memory from pre-allocated pools (which we are not) we can't make any guarantee that a Ceph daemon is going to fit within a specific memory allocation at any given point in time.  The osd_memory_target code works surprisingly well for what it is, but as discussed above it's not a guarantee (which is precisely why it's called a target).  That means we can either leave the limits off, or set them to something high enough where we feel relatively confident the OSD won't go OOM, but we'll still catch a misbehaving daemon before it takes other containers (or the whole container node) down.


We can address this issue with a semi-workaround currently:
set osd_memory_target explicitly in Ceph's config, and set an appropriate k8s resource request matching (osd_memory_target + 1GB + some extra) to meet the hardware recommendation. However, means that the Ceph feature of setting osd_memory_target based on resource requests isn't really used because it doesn't behave to actual best practices. And setting a realistic k8s resource request is useful for kubernetes so that it won't schedule more daemons onto a node than the node can realistically support.


There really are multiple ways an OSD can use more RSS memory than expected.  We can have temporary overages due to sudden allocations in the OSD that we haven't yet compensated for (We try to keep some wiggle room and periodically check memory usage to adjust the caches down, but this doesn't happen instantaneously). You can have fragmentation below the OSD where we might have freed memory that can't be reclaimed.  They may be a large memory allocation that we can't fully compensate for even after shrinking caches to their minimums.  This is especially true for smaller than default osd_memory_targets.  The more memory you give each container over the target the less likely you are to hit a situation where a spike causes an OOM kill.  How much extra you need per container is incredibly complicated and based on a number of factors including fragmentation, transparent huge pages settings, PG count, incoming write rate, rocksdb memtable sizes, rocksdb memtable flushing speed, rocksdb compaction rate, osd_memory_target, bluestore_cache_autotune_chunk_size, osd_memory_expected_fragmentation, osd_memory_cache_resize_interval, and various other things. Potentially if there was some way for the OSD to negotiate with the container it might be able to adjust that overage amount on the fly (say start at 1GB and adjust over time based on historical overages), but right now I don't think there's any capability to do that.||||||||||||||||||



Long-term, I wonder if it is good to add into Ceph a computation that [[ osd_memory_target = REQUEST + osd_memory_request_overhead ]] where the osd_memory_requests overhead defaults to 1GB or somewhat higher.

Originally I think the idea was we were going to give the OSD some percentage amount higher like 20% but we kept periodically exceeding it with the default 4MB target.  That also was when THP was enabled by default and we were seeing large amounts of memory space amplification due to fragmentation (and there may have been some other bugs regarding how the limit was calculated).  I'm not sure if there are currently any container limits in place at all, but one of the rook guys can probably say what the current status is.


Please discuss, and let me know if anything here seems like I've gotten it wrong or if there are other options I haven't seen.

Cheers, and happy Tuesday!
Blaine
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx




[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux