On Thu, 7 Mar 2019, Sage Weil wrote: > On Thu, 7 Mar 2019, Sebastien Han wrote: > > Let me take back something, afterthoughts, yes "memory.request" is > > what we are interested in, the only thing is that there is no > > /sys/fs/cgroup/memory equivalent. > > Okay, if that's the case, then we have to take the limit from the > kubernetes-specific environment variable. The downside is it's > kube-specific, but so be it. > > > If we take the OSD example, we could set REQUEST to 1GB and LIMIT to > > 2GB (making up numbers again) which leaves 1GB for backfilling and > > other stuff. > > Just a side-note here: there is no longer an recovery vs non-recovery > memory utilization delta. IIRC we even backported that config > osd_pg_log_{min,max}_entries change to luminous too. > > > The last thing is, as part of these 4GB how much do we want to give to > > the osd_memory_target. > > I think we should set osd_memory_target to POD_MEMORY_REQUEST if it's > non-zero. If it is 0 and POD_MEMORY_LIMIT is set, I suggest .8 * that. Hmm, I think the place to do that is here: https://github.com/ceph/ceph/blob/master/src/common/config.cc#L456 It's a bit awkward to parse the env variable into one of {osd,mds,mon}_memory_target based on the daemon type, but it could be done (once the others exist). I wonder if, instead, we should have named the option just memory_target. :/ sage > > sage > > > > > Thanks! > > ––––––––– > > Sébastien Han > > Principal Software Engineer, Storage Architect > > > > "Always give 100%. Unless you're giving blood." > > > > On Thu, Mar 7, 2019 at 9:51 AM Sebastien Han <shan@xxxxxxxxxx> wrote: > > > > > > Replying to everyone in a batch: > > > > > > * Yes, memory.limit is always >= memory.request. If memory.request is > > > unset and memory.limit then memory.request = memory.limit > > > k8s will refuse to schedule a POD if memory.limit < memory.request. > > > The interesting part, if memory.limit is set (e.g: 1GB) and > > > memory.request is unset, during the scheduling process, the scheduler > > > will do memory.request = memory.limit > > > However, once the POD is running, the env variables will respectively > > > be POD_MEMORY_LIMIT=1GB (in bytes) POD_MEMORY_REQUEST=0... So that's > > > confusing. > > > Generally, we do care about both LIMIT and REQUEST because REQUEST is > > > only used for scheduling so someone could do memory.request = 512MB > > > and memory.limit 1GB to make sure the POD can be scheduled. > > > In the Ceph context, this makes perfect sense since, for example, an > > > OSD might need (making numbers up) 2GB for REQUEST (memory consumed > > > 80% of the time but 4GB for LIMIT ("burstable" when we backfill). > > > So we should read LIMIT only I believe and apply our thresholds based > > > on the value we read if = 0. > > > > > > * The current Rook PR: https://github.com/rook/rook/pull/2764, which > > > sets both osd_memory_target and mds_cache_memory_limit at startup on > > > the daemon CLI. > > > I consider this as temporary until we get this done in Ceph natively > > > (by reading env vars). > > > > > > * LIMIT is what should be considered as the actual memory pool > > > available for a daemon, auto-tuning should be done based on that > > > value, thresholds and alerts too. > > > > > > * The cgroup equivalent of POD_MEMORY_LIMIT is > > > "/sys/fs/cgroup/memory/memory.limit_in_bytes" > > > If memory.limit_in_bytes is equal to 9223372036854771712 then there is > > > no limit and we shouldn't probably tune anything or try to come up > > > with a formula. > > > There is also memory.usage_in_bytes that can be useful. > > > > > > * The route to uniformity with *_memory_limit is something that would > > > help us in the short term to configure orchestrator where the long run > > > will be Ceph tuning that variable on its own. > > > > > > Thanks! > > > ––––––––– > > > Sébastien Han > > > Principal Software Engineer, Storage Architect > > > > > > "Always give 100%. Unless you're giving blood." > > > > > > On Tue, Mar 5, 2019 at 7:06 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote: > > > > > > > > > > > > On 3/5/19 11:51 AM, Patrick Donnelly wrote: > > > > > On Tue, Mar 5, 2019 at 8:41 AM Sage Weil <sage@xxxxxxxxxxxx> wrote: > > > > >>> If memory.requests is omitted for a container, it defaults to limits. > > > > >>> If memory.limits is not set, it defaults to 0 (unbounded). > > > > >>> If none of these 2 are specified then we don't tune anything because > > > > >>> we don't really know what to do. > > > > >>> > > > > >>> So far I've collected a couple of Ceph flags that are worth tuning: > > > > >>> > > > > >>> * mds_cache_memory_limit > > > > >>> * osd_memory_target > > > > >>> > > > > >>> These flags will be passed at instantiation time for the MDS and the OSD daemon. > > > > >>> Since most of the daemons have some cache flag, it'll be nice to unify > > > > >>> them with a new option --{daemon}-memory-target. > > > > >>> Currently I'm exposing POD properties via env var too that Ceph can > > > > >>> consume later for more autotuning (POD_{MEMORY,CPU}_LIMIT, > > > > >>> POD_{CPU,MEMORY}_REQUEST. > > > > >> Ignoring mds_cache_memory_limit for now; I think we should wait until we > > > > >> have mds_memory_target before doing any magic there. > > > > >> > > > > >> For the osd_memory_target, though, I think we could make the OSD pick up > > > > >> on the POD_MEMORY_REQUEST variable and, if present, set osd_memory_target > > > > >> to that value. Or, instead of putting the burden on ceph, simply have > > > > >> rook pass --osd-memory-target on the command line, or (post-startup) do > > > > >> 'ceph daemon osd.N config set osd_memory_target ...'. (The advantage of > > > > >> the latter is that it can more easily be overridden at runtime.) > > > > > Is POD_MEMORY_LIMIT|REQUEST standardized somewhere? Using an > > > > > environment variable to communicate resource restrictions is useful > > > > > but also hard to change on-the-fly. Can we (Ceph) read this > > > > > information from the cgroup the Ceph daemon has been assigned to? > > > > > Reducing the amount of configuration is one of our goals so if we can > > > > > make Ceph more aware of its environment as far as resource > > > > > constraints, we should go that route. > > > > > > > > > > The MDS should self-configure mds_cache_memory_limit based on > > > > > memory.requests. That takes the magic formula out of the hands of > > > > > users and forgetful devs :) > > > > > > > > > >> I'm not sure we have any specific action on the POD_MEMORY_LIMIT value.. > > > > >> the OSD should really be aiming for the REQUEST value instead. > > > > > I agree we should focus on memory.requests. > > > > > > > > > > > > > I mentioned it in another doc, but I suspect it would be fairly easy to > > > > adapt the osd_memory_limit and autotuner code to work in the mds so long > > > > as we can adjust mds_cache_memory_limit on the fly. It would be really > > > > nice to have all of the daemons conform to standard *_memory_limit > > > > interface. That's useful both inside a container environment and on > > > > bare metal. > > > > > > > > > > > > Mark > > > > > > > > > >