(Please use reply-all so that people I've explicitly tagged and I can continue to get direct email replies) I have a conundrum around developing best practices for Rook/Ceph clusters around OSD memory targets and hardware recommendations for OSDs. I want to lay down some bulleted notes first. - 'osd_memory_target' defaults to 4GB - OSDs attempt to keep memory allocation to 'osd_memory_target' - BUT... this is only best-effort - AND... there is no guarantee that the kernel will actually reclaim memory that OSDs release/unmap - Therefore, we (SUSE) have developed a recommendation that ... Total OSD node RAM required = (num OSDs) x (1 GB + osd_memory_target) + 16 GB - In Rook/Kubernetes, Ceph OSDs will read the POD_MEMORY_REQUEST and POD_MEMORY_LIMIT env vars to infer a new default value for 'osd_memory_target' - POD_MEMORY_REQUEST translates directly to 'osd_memory_target' 1:1 - POD_MEMORY_LIMIT (if REQUEST is unset) will set 'osd_memory_target' using the formula ( LIMIT x osd_memory_target_cgroup_limit_ratio ) - the default 'osd_memory_target' will be = min(REQUEST, LIMIT*ratio) - Lars has suggested that setting limits is not a best practice for Ceph; when limits are encountered, Ceph is likely in a failure state, and killing daemons could result in a "thundering herds" distributed systems problem As you can see, there is a self-referential problem here. The OSD hardware recommendation should inform us how to set k8s resource limits for OSDs; however, doing so will affect osd_memory_target, which alters the recommendation, which further alters our k8s resource limit circularly forever. We can address this issue with a semi-workaround currently: set osd_memory_target explicitly in Ceph's config, and set an appropriate k8s resource request matching (osd_memory_target + 1GB + some extra) to meet the hardware recommendation. However, means that the Ceph feature of setting osd_memory_target based on resource requests isn't really used because it doesn't behave to actual best practices. And setting a realistic k8s resource request is useful for kubernetes so that it won't schedule more daemons onto a node than the node can realistically support. Long-term, I wonder if it is good to add into Ceph a computation that [[ osd_memory_target = REQUEST + osd_memory_request_overhead ]] where the osd_memory_requests overhead defaults to 1GB or somewhat higher. Please discuss, and let me know if anything here seems like I've gotten it wrong or if there are other options I haven't seen. Cheers, and happy Tuesday! Blaine _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx