On 5/12/20 1:57 PM, Blaine Gardner wrote:
(Please use reply-all so that people I've explicitly tagged and I can continue to get direct email replies)
I have a conundrum around developing best practices for Rook/Ceph clusters around OSD memory targets and hardware recommendations for OSDs. I want to lay down some bulleted notes first.
- 'osd_memory_target' defaults to 4GB
Correct
- OSDs attempt to keep memory allocation to 'osd_memory_target'
Specifically the OSD will attempt to keep the mapped memory below the
osd_memory_target but with potential, hopefully small, temporary
overages. The bigger a recent overage is the more aggressively the
priority cache manager will react when decreasing cache memory
allocation. All of this is in relation to mapped memory however, not
RSS memory.
- BUT... this is only best-effort
Correct
- AND... there is no guarantee that the kernel will actually reclaim memory that OSDs release/unmap
Correct, we can't explicitly force the kernel to reclaim memory.
- Therefore, we (SUSE) have developed a recommendation that ...
Total OSD node RAM required = (num OSDs) x (1 GB + osd_memory_target) + 16 GB
That's probably fairly reasonable in most cases though off at the
extremes (dedicating 21GB for a single OSD with a 4 GB memory target is
almost certainly overkill).
- In Rook/Kubernetes, Ceph OSDs will read the POD_MEMORY_REQUEST and POD_MEMORY_LIMIT env vars to infer a new default value for 'osd_memory_target'
- POD_MEMORY_REQUEST translates directly to 'osd_memory_target' 1:1
- POD_MEMORY_LIMIT (if REQUEST is unset) will set 'osd_memory_target' using the formula ( LIMIT x osd_memory_target_cgroup_limit_ratio )
- the default 'osd_memory_target' will be = min(REQUEST, LIMIT*ratio)
- Lars has suggested that setting limits is not a best practice for Ceph; when limits are encountered, Ceph is likely in a failure state, and killing daemons could result in a "thundering herds" distributed systems problem
As you can see, there is a self-referential problem here. The OSD hardware recommendation should inform us how to set k8s resource limits for OSDs; however, doing so will affect osd_memory_target, which alters the recommendation, which further alters our k8s resource limit circularly forever.
My take is that unless we are strictly allocating memory from
pre-allocated pools (which we are not) we can't make any guarantee that
a Ceph daemon is going to fit within a specific memory allocation at any
given point in time. The osd_memory_target code works surprisingly well
for what it is, but as discussed above it's not a guarantee (which is
precisely why it's called a target). That means we can either leave the
limits off, or set them to something high enough where we feel
relatively confident the OSD won't go OOM, but we'll still catch a
misbehaving daemon before it takes other containers (or the whole
container node) down.
We can address this issue with a semi-workaround currently:
set osd_memory_target explicitly in Ceph's config, and set an appropriate k8s resource request matching (osd_memory_target + 1GB + some extra) to meet the hardware recommendation. However, means that the Ceph feature of setting osd_memory_target based on resource requests isn't really used because it doesn't behave to actual best practices. And setting a realistic k8s resource request is useful for kubernetes so that it won't schedule more daemons onto a node than the node can realistically support.
There really are multiple ways an OSD can use more RSS memory than
expected. We can have temporary overages due to sudden allocations in
the OSD that we haven't yet compensated for (We try to keep some wiggle
room and periodically check memory usage to adjust the caches down, but
this doesn't happen instantaneously). You can have fragmentation below
the OSD where we might have freed memory that can't be reclaimed. They
may be a large memory allocation that we can't fully compensate for even
after shrinking caches to their minimums. This is especially true for
smaller than default osd_memory_targets. The more memory you give each
container over the target the less likely you are to hit a situation
where a spike causes an OOM kill. How much extra you need per container
is incredibly complicated and based on a number of factors including
fragmentation, transparent huge pages settings, PG count, incoming write
rate, rocksdb memtable sizes, rocksdb memtable flushing speed, rocksdb
compaction rate, osd_memory_target, bluestore_cache_autotune_chunk_size,
osd_memory_expected_fragmentation, osd_memory_cache_resize_interval, and
various other things. Potentially if there was some way for the OSD to
negotiate with the container it might be able to adjust that overage
amount on the fly (say start at 1GB and adjust over time based on
historical overages), but right now I don't think there's any capability
to do that.||||||||||||||||||
Long-term, I wonder if it is good to add into Ceph a computation that [[ osd_memory_target = REQUEST + osd_memory_request_overhead ]] where the osd_memory_requests overhead defaults to 1GB or somewhat higher.
Originally I think the idea was we were going to give the OSD some
percentage amount higher like 20% but we kept periodically exceeding it
with the default 4MB target. That also was when THP was enabled by
default and we were seeing large amounts of memory space amplification
due to fragmentation (and there may have been some other bugs regarding
how the limit was calculated). I'm not sure if there are currently any
container limits in place at all, but one of the rook guys can probably
say what the current status is.
Please discuss, and let me know if anything here seems like I've gotten it wrong or if there are other options I haven't seen.
Cheers, and happy Tuesday!
Blaine
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx