Re: Developing best-practices around Ceph daemons and kubernetes memory limits

Lars Marowsky-Bree <lmb@xxxxxxxx> · Fri, 27 Mar 2020 15:57:52 +0100

On 2020-03-26T16:31:29, Blaine Gardner <BlGardner@xxxxxxxx> wrote:

Hi Blaine,

thanks for bringing this up.

> Advice I got from Joao: In the case of Ceph monitors, they are more
> likely to be experiencing memory over-use during recovery scenarios,
> and killing mons during this due to exceeding a limit may make the
> problem much worse. The best-practice I have here is to only set
> memory requests for Ceph mons, ideally 4GB.
> 
> In the case of OSDs, things are a little more complex. OSDs will read
> the POD_MEMORY_REQUEST and POD_MEMORY_LIMIT environment variables
> which are set by Rook inside Kubernetes pods, and OSD will tune their
> memory usage to meet this.

And that's great (though at that point, we probably ought to prevent
the manual setting of the memory target, since it's based on this
external setting?).

> the risks of setting (or not setting) Pod Memory Limits on OSDs
> knowing that if the limit is set too low or if the OSDs begin to
> memory leak, they will be terminated and restarted by Kubernetes?  -
> One risk I can imagine is that if OSDs are all started at nearly the
> same time and experience similar loads, they might be likely to leak
> memory at similar rates and be killed by Kubernetes at about the same
> time. Stampeding herds of OSD memory leaks followed by memory limit
> terminations might occur which could ripple to causing other OSDs to
> become unstable.  - Not setting a limit might mean that OSDs
> experience memory leak and cause OOM situations for other daemons or
> for the Kubernetes kubelet if the system settings don't guarantee
> kubelet some amount of resources.

I think the risk of killing OSDs by accident as a false threshold
exception is too high; the impact can be that other pods fail (we're
providing storage to them, after all).

The same can be said for anything that's in the IO path.

Warning and alerting yes. But unless the memory leaks are *really*
severe (and that's hard to quantify, 150%, 200% the expected max?),
trying to service the storage stack is probably still sensible. The
ripple effect of killing them is massive.

The kernel OOM will still go after processes that run completely amok if
the total capacity of the system is exceeded.

> Is it good to kill daemons if they exceed a limit in order to prevent
> memory leaks from affecting the rest of the system? MDS? RGW? MGR?
> NFS-Ganesha?

It might not matter as much with the mgr, but a misconfigured memory
limit repeatedly killing that one is likely painful too.

Killing an MDS/NFS instance might stop client systems from being able to
flush their dirty buffers, making the overall IO/memory situation worse.

> If anyone has knowledgeable recommendations about any daemons, I'd
> love your input. Please reply-all so that I get replies straight to my
> inbox.

Based on experience with HA stacks, I'd be very very careful with
killing storage path components. At the very least, those limits, be
they timeouts or resource caps, need to be extremely generous, because
of the ripple effects.

Typically, the storage system is given minimum resource guarantees and
other workloads are limited to not interfere - not vice-versa.

Again, warning/alerting is sensible.

Regards,
    Lars

-- 
SUSE Software Solutions Germany GmbH, MD: Felix Imendörffer, HRB 36809 (AG Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli Zbinden)
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx