Hi Adam No problem, i really appreciate your input :) The memory stats returned are as follows "memory_available_kb": 19858056, "memory_free_kb": 277480, "memory_total_kb": 32827840, On Thu, Apr 4, 2024 at 10:14 PM Adam King <adking@xxxxxxxxxx> wrote: > Sorry to keep asking for more info, but can I also get what `cephadm > gather-facts` on that host returns for "memory_total_kb". Might end up > creating a unit test out of this case if we have a calculation bug here. > > On Thu, Apr 4, 2024 at 4:05 PM Mads Aasted <mads2a@xxxxxxxxx> wrote: > >> sorry for the double send, forgot to hit reply all so it would appear on >> the page >> >> Hi Adam >> >> If we multiply by 0.7, and work through the previous example from that >> number, we would still arrive at roughly 2.5 gb for each osd. And the host >> in question is trying to set it to less than 500mb. >> I have attached a list of the processes running on the host. Currently >> you can even see that the OSD's are taking up the most memory by far, and >> at least 5x its proposed minimum. >> root@my-ceph01:/# ceph orch ps | grep my-ceph01 >> crash.my-ceph01 my-ceph01 running (3w) >> 7m ago 13M 9052k - 17.2.6 >> grafana.my-ceph01 my-ceph01 *:3000 running (3w) >> 7m ago 13M 95.6M - 8.3.5 >> mds.testfs.my-ceph01.xjxfzd my-ceph01 running (3w) 7m >> ago 10M 485M - 17.2.6 >> mds.prodfs.my-ceph01.rplvac my-ceph01 running (3w) >> 7m ago 12M 26.9M - 17.2.6 >> mds.prodfs.my-ceph01.twikzd my-ceph01 running (3w) >> 7m ago 12M 26.2M - 17.2.6 >> mgr.my-ceph01.rxdefe my-ceph01 *:8443,9283 running (3w) >> 7m ago 13M 907M - 17.2.6 >> mon.my-ceph01 my-ceph01 running (3w) >> 7m ago 13M 503M 2048M 17.2.6 >> node-exporter.my-ceph01 my-ceph01 *:9100 running (3w) >> 7m ago 13M 20.4M - 1.5.0 >> osd.3 my-ceph01 running (3w) >> 7m ago 11M 2595M 4096M 17.2.6 >> osd.5 my-ceph01 running (3w) >> 7m ago 11M 2494M 4096M 17.2.6 >> osd.6 my-ceph01 running (3w) >> 7m ago 11M 2698M 4096M 17.2.6 >> osd.9 my-ceph01 running (3w) >> 7m ago 11M 3364M 4096M 17.2.6 >> prometheus.my-ceph01 my-ceph01 *:9095 running (3w) >> 7m ago 13M 164M - 2.42.0 >> >> >> >> >> On Thu, Mar 28, 2024 at 2:13 AM Adam King <adking@xxxxxxxxxx> wrote: >> >>> I missed a step in the calculation. The total_memory_kb I mentioned >>> earlier is also multiplied by the value of the >>> mgr/cephadm/autotune_memory_target_ratio before doing the subtractions for >>> all the daemons. That value defaults to 0.7. That might explain it seeming >>> like it's getting a value lower than expected. Beyond that, I'd think 'i'd >>> need a list of the daemon types and count on that host to try and work >>> through what it's doing. >>> >>> On Wed, Mar 27, 2024 at 10:47 AM Mads Aasted <mads2a@xxxxxxxxx> wrote: >>> >>>> Hi Adam. >>>> >>>> So doing the calculations with what you are stating here I arrive at a >>>> total sum for all the listed processes at 13.3 (roughly) gb, for everything >>>> except the osds, leaving well in excess of +4gb for each OSD. >>>> Besides the mon daemon which i can tell on my host has a limit of 2gb , >>>> none of the other daemons seem to have a limit set according to ceph orch >>>> ps. Then again, they are nowhere near the values stated in min_size_by_type >>>> that you list. >>>> Obviously yes, I could disable the auto tuning, but that would leave me >>>> none the wiser as to why this exact host is trying to do this. >>>> >>>> >>>> >>>> On Tue, Mar 26, 2024 at 10:20 PM Adam King <adking@xxxxxxxxxx> wrote: >>>> >>>>> For context, the value the autotune goes with takes the value from >>>>> `cephadm gather-facts` on the host (the "memory_total_kb" field) and then >>>>> subtracts from that per daemon on the host according to >>>>> >>>>> min_size_by_type = { >>>>> 'mds': 4096 * 1048576, >>>>> 'mgr': 4096 * 1048576, >>>>> 'mon': 1024 * 1048576, >>>>> 'crash': 128 * 1048576, >>>>> 'keepalived': 128 * 1048576, >>>>> 'haproxy': 128 * 1048576, >>>>> 'nvmeof': 4096 * 1048576, >>>>> } >>>>> default_size = 1024 * 1048576 >>>>> >>>>> what's left is then divided by the number of OSDs on the host to >>>>> arrive at the value. I'll also add, since it seems to be an issue on this >>>>> particular host, if you add the "_no_autotune_memory" label to the host, >>>>> it will stop trying to do this on that host. >>>>> >>>>> On Mon, Mar 25, 2024 at 6:32 PM <mads2a@xxxxxxxxx> wrote: >>>>> >>>>>> I have a virtual ceph cluster running 17.2.6 with 4 ubuntu 22.04 >>>>>> hosts in it, each with 4 OSD's attached. The first 2 servers hosting mgr's >>>>>> have 32GB of RAM each, and the remaining have 24gb >>>>>> For some reason i am unable to identify, the first host in the >>>>>> cluster appears to constantly be trying to set the osd_memory_target >>>>>> variable to roughly half of what the calculated minimum is for the cluster, >>>>>> i see the following spamming the logs constantly >>>>>> Unable to set osd_memory_target on my-ceph01 to 480485376: error >>>>>> parsing value: Value '480485376' is below minimum 939524096 >>>>>> Default is set to 4294967296. >>>>>> I did double check and osd_memory_base (805306368) + >>>>>> osd_memory_cache_min (134217728) adds up to minimum exactly >>>>>> osd_memory_target_autotune is currently enabled. But i cannot for the >>>>>> life of me figure out how it is arriving at 480485376 as a value for that >>>>>> particular host that even has the most RAM. Neither the cluster or the host >>>>>> is even approaching max utilization on memory, so it's not like there are >>>>>> processes competing for resources. >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>>> >>>>>> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx