osd_memory_target in Rook, set automatically via k8s Resource Requests, and hardware recommendations

Blaine Gardner <BlGardner@xxxxxxxx> · Tue, 12 May 2020 18:57:51 +0000

(Please use reply-all so that people I've explicitly tagged and I can continue to get direct email replies)

I have a conundrum around developing best practices for Rook/Ceph clusters around OSD memory targets and hardware recommendations for OSDs. I want to lay down some bulleted notes first.
- 'osd_memory_target' defaults to 4GB
- OSDs attempt to keep memory allocation to 'osd_memory_target'
- BUT... this is only best-effort
- AND... there is no guarantee that the kernel will actually reclaim memory that OSDs release/unmap
- Therefore, we (SUSE) have developed a recommendation that ...
   Total OSD node RAM required = (num OSDs) x (1 GB + osd_memory_target) + 16 GB 
- In Rook/Kubernetes, Ceph OSDs will read the POD_MEMORY_REQUEST and POD_MEMORY_LIMIT env vars to infer a new default value for 'osd_memory_target'
   - POD_MEMORY_REQUEST translates directly to 'osd_memory_target' 1:1
   - POD_MEMORY_LIMIT (if REQUEST is unset) will set 'osd_memory_target' using the formula ( LIMIT x osd_memory_target_cgroup_limit_ratio )
   - the default 'osd_memory_target' will be  = min(REQUEST, LIMIT*ratio)
- Lars has suggested that setting limits is not a best practice for Ceph; when limits are encountered, Ceph is likely in a failure state, and killing daemons could result in a "thundering herds" distributed systems problem

As you can see, there is a self-referential problem here. The OSD hardware recommendation should inform us how to set k8s resource limits for OSDs; however, doing so will affect osd_memory_target, which alters the recommendation, which further alters our k8s resource limit circularly forever. 

We can address this issue with a semi-workaround currently: 
set osd_memory_target explicitly in Ceph's config, and set an appropriate k8s resource request matching (osd_memory_target + 1GB + some extra) to meet the hardware recommendation. However, means that the Ceph feature of setting osd_memory_target based on resource requests isn't really used because it doesn't behave to actual best practices. And setting a realistic k8s resource request is useful for kubernetes so that it won't schedule more daemons onto a node than the node can realistically support.

Long-term, I wonder if it is good to add into Ceph a computation that [[ osd_memory_target = REQUEST + osd_memory_request_overhead ]] where the osd_memory_requests overhead defaults to 1GB or somewhat higher.

Please discuss, and let me know if anything here seems like I've gotten it wrong or if there are other options I haven't seen.

Cheers, and happy Tuesday!
Blaine
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx