Hi Dan and Mark, could you please let me know if you can read the files with the version info I provided in my previous e-mail? I'm in the process of collecting data with more FS activity and would like to send it in a format that is useful for investigation. Right now I'm observing a daily growth of swap of ca. 100-200MB on servers with 16 OSDs each, 1SSD and 15HDDs. The OS+daemons operate fine, the OS manages to keep enough RAM available. Also the mempool dump still shows onode and data cached at a seemingly reasonable level. Users report a more stable performance of the FS after I increased the cach min sizes on all OSDs. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Frank Schilder <frans@xxxxxx> Sent: 17 August 2020 09:37 To: Dan van der Ster Cc: ceph-users Subject: Re: OSD memory leak? Hi Dan, I use the container docker.io/ceph/daemon:v3.2.10-stable-3.2-mimic-centos-7-x86_64. As far as I can see, it uses the packages from http://download.ceph.com/rpm-mimic/el7, its a Centos 7 build. The version is: # ceph -v ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable) On Centos, the profiler packages are called different, without the "google-" prefix. The version I have installed is # pprof --version pprof (part of gperftools 2.0) Copyright 1998-2007 Google Inc. This is BSD licensed software; see the source for copying conditions and license information. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. It is possible to install pprof inside this container and analyse the *.heap-files I provided. If this doesn't work for you and you want me to generate the text output for heap-files, I can do that. Please let me know if I should do all files and with what option (eg. against a base etc.). Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Dan van der Ster <dan@xxxxxxxxxxxxxx> Sent: 14 August 2020 10:38:57 To: Frank Schilder Cc: Mark Nelson; ceph-users Subject: Re: Re: OSD memory leak? Hi Frank, I'm having trouble getting the exact version of ceph you used to create this heap profile. Could you run the google-pprof --text steps at [1] and share the output? Thanks, Dan [1] https://docs.ceph.com/docs/master/rados/troubleshooting/memory-profiling/ On Tue, Aug 11, 2020 at 2:37 PM Frank Schilder <frans@xxxxxx> wrote: > > Hi Mark, > > here is a first collection of heap profiling data (valid 30 days): > > https://files.dtu.dk/u/53HHic_xx5P1cceJ/heap_profiling-2020-08-03.tgz?l > > This was collected with the following config settings: > > osd dev osd_memory_cache_min 805306368 > osd basic osd_memory_target 2147483648 > > Setting the cache_min value seems to help keeping cache space available. Unfortunately, the above collection is for 12 days only. I needed to restart the OSD and will need to restart it soon again. I hope I can then run a longer sample. The profiling does cause slow ops though. > > Maybe you can see something already? It seems to have collected some leaked memory. Unfortunately, it was a period of extremely low load. Basically, with the day of recording the utilization dropped to almost zero. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Frank Schilder <frans@xxxxxx> > Sent: 21 July 2020 12:57:32 > To: Mark Nelson; Dan van der Ster > Cc: ceph-users > Subject: Re: OSD memory leak? > > Quick question: Is there a way to change the frequency of heap dumps? On this page http://goog-perftools.sourceforge.net/doc/heap_profiler.html a function HeapProfilerSetAllocationInterval() is mentioned, but no other way of configuring this. Is there a config parameter or a ceph daemon call to adjust this? > > If not, can I change the dump path? > > Its likely to overrun my log partition quickly if I cannot adjust either of the two. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Frank Schilder <frans@xxxxxx> > Sent: 20 July 2020 15:19:05 > To: Mark Nelson; Dan van der Ster > Cc: ceph-users > Subject: Re: OSD memory leak? > > Dear Mark, > > thank you very much for the very helpful answers. I will raise osd_memory_cache_min, leave everything else alone and watch what happens. I will report back here. > > Thanks also for raising this as an issue. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Mark Nelson <mnelson@xxxxxxxxxx> > Sent: 20 July 2020 15:08:11 > To: Frank Schilder; Dan van der Ster > Cc: ceph-users > Subject: Re: Re: OSD memory leak? > > On 7/20/20 3:23 AM, Frank Schilder wrote: > > Dear Mark and Dan, > > > > I'm in the process of restarting all OSDs and could use some quick advice on bluestore cache settings. My plan is to set higher minimum values and deal with accumulated excess usage via regular restarts. Looking at the documentation (https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/), I find the following relevant options (with defaults): > > > > # Automatic Cache Sizing > > osd_memory_target {4294967296} # 4GB > > osd_memory_base {805306368} # 768MB > > osd_memory_cache_min {134217728} # 128MB > > > > # Manual Cache Sizing > > bluestore_cache_meta_ratio {.4} # 40% ? > > bluestore_cache_kv_ratio {.4} # 40% ? > > bluestore_cache_kv_max {512 * 1024*1024} # 512MB > > > > Q1) If I increase osd_memory_cache_min, should I also increase osd_memory_base by the same or some other amount? > > > osd_memory_base is a hint at how much memory the OSD could consume > outside the cache once it's reached steady state. It basically sets a > hard cap on how much memory the cache will use to avoid over-committing > memory and thrashing when we exceed the memory limit. It's not necessary > to get it right, it just helps smooth things out by making the automatic > memory tuning less aggressive. IE if you have a 2 GB memory target and > a 512MB base, you'll never assign more than 1.5GB to the cache on the > assumption that the rest of the OSD will eventually need 512MB to > operate even if it's not using that much right now. I think you can > probably just leave it alone. What you and Dan appear to be seeing is > that this number isn't static in your case but increases over time any > way. Eventually I'm hoping that we can automatically account for more > and more of that memory by reading the data from the mempools. > > > Q2) The cache ratio options are shown under the section "Manual Cache Sizing". Do they also apply when cache auto tuning is enabled? If so, is it worth changing these defaults for higher values of osd_memory_cache_min? > > > They actually do have an effect on the automatic cache sizing and > probably shouldn't only be under the manual section. When you have the > automatic cache sizing enabled, those options will affect the "fair > share" values of the different caches at each cache priority level. IE > at priority level 0, if both caches want more memory than is available, > those ratios will determine how much each cache gets. If there is more > memory available than requested, each cache gets as much as they want > and we move on to the next priority level and do the same thing again. > So in this case the ratios end up being sort of more like fallback > settings for when you don't have enough memory to fulfill all cache > requests at a given priority level, but otherwise are not utilized until > we hit that limit. The goal with this scheme is to make sure that "high > priority" items in each cache get first dibs at the memory even if it > might skew the ratios. This might be things like rocksdb bloom filters > and indexes, or potentially very recent hot items in one cache vs very > old items in another cache. The ratios become more like guidelines than > hard limits. > > > When you change to manual mode, you set an overall bluestore cache size > and each cache gets a flat percentage of it based on the ratios. With > 0.4/0.4 you will always have 40% for onode, 40% for omap, and 20% for > data even if one of those caches does not use all of it's memory. > > > > > > Many thanks for your help with this. I can't find answers to these questions in the docs. > > > > There might be two reasons for high osd_map memory usage. One is, that our OSDs seem to hold a large number of OSD maps: > > > I brought this up in our core team standup last week. Not sure if > anyone has had time to look at it yet though. > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx