Hello Dave, Try this: ceph config rm osd/host:ceph09 osd_memory_target The document is here: https://docs.ceph.com/en/reef/rados/configuration/ceph-conf/#sections-and-masks On Wed, Oct 23, 2024 at 3:27 AM Dave Hall <kdhall@xxxxxxxxxxxxxx> wrote: > > Athony and all, > > At this point I would like to try to turn off osd_memory_target_autotune, > reset my osd_memory_targets, and see how things work. However, I have not > figured out the incantation for removing (or even changing) the four > separate osd_memory_target settings that I seem to have. > > # ceph config dump | grep osd_memory_target | cut -c 1-120 > osd host:ceph00 basic osd_memory_target > 9795377735 > osd host:ceph01 basic osd_memory_target > 10408081664 > osd host:ceph02 basic osd_memory_target > 11381160192 > osd host:ceph09 basic osd_memory_target > 22260320563 > osd advanced > osd_memory_target_autotune true > > > In the dashboard edit panel for osd_memory_target, only the fourth value is > shown, so I guess managing this kind of multi-valued attribute via the > dashboard is not possible. Working with the CLI I have not found a way to > get anything back except for the default value. > > I assume that there is a document somewhere that would explain the extended > syntax for ceph config, but I haven't found it. > > Thanks. > > -Dave > > -- > Dave Hall > Binghamton University > kdhall@xxxxxxxxxxxxxx > > On Wed, Oct 16, 2024 at 12:02 PM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote: > > > > > > Unfortunately, its not quite that simple. At least until mimic, but > > potentially later too there was this behavior that either the OSD's > > allocator did not release or the kernel did not reclaim unused pages if > > there was sufficient total memory available. Which implied pointless > > swapping. The observation was exactly what Dave describes, huge resident > > memory size without any load. The resident memory size just stayed high for > > no apparent reason. > > > > I’ve seen that on non-Ceph systems too. Sometimes with Ceph I see > > tcmalloc not actually freeing unused mem; in those situations a “heap > > release” on the admin socket does wonders. I haven’t seen that since … > > Nautilus perhaps. > > > > > The consequences were bad though, because during peering apparently the > > "leaked memory" started playing a role and OSDs crashed due to pages on > > swap not fitting into RAM. > > > > Back in the BSD days swap had to be >= physmem, these days we skate SysV > > style where swap extends the VM space instead of backing it. > > > > > Having said that, we do have large swap partitions on disk for emergency > > cases. We have swap off by default to avoid the memory "leak" issue and we > > actually have sufficient RAM to begin with - maybe that's a bit of a luxury. > > > > I can’t argue with that strategy, if your boot drives are large enough. > > I’ve as recently as this year suffered legacy systems with as little as > > 100GB boot drives — so overly balkanized that no partition was large enough. > > > > K8s as I understand it won’t even run if swap is enabled. Swap to me is > > what we did in 1988 when RAM cost money and we had 3MB (yes) diskless (yes) > > workstations. Out of necessity. > > > > > The swap recommendation is a contentious one - I, for one, have always > > been against it. > > > > Same here. It’s a relic of the days when RAM was dramatically more > > expensive. I’ve had this argument with people stuck in the past, even when > > the resident performance expert 100% agreed with me. > > > > >IMHO, disabling swap is a recommendation that comes up because folks are > > afraid of their OSDs becoming sluggish when their hosts become > > >oversubscribed. > > > > In part yes. I tell people all the time that Ceph is usually better off > > with a failed component than a crippled component. > > > > >But why not just avoid oversubscription altogether? > > > > Well, yeah. In the above case, with non-Ceph systems, there were like > > 2000 of them at unstaffed DCs around the world that were DellR430s with > > only 64GB. There was a closely-guarded secret that deployments were > > blue-green so enough vmem was needed to run two copies for brief > > intervals. Upgrading them would have been prohibitively expensive, even if > > they weren’t already like 8 years old. Plus certain people were stubborn. > > > > > > > If you set appropriate OSD memory targets, set kernel swapiness to > > > something like 10-20, and properly pin your OSDs in a system with >1 NUMA > > > node so that they're evenly distributed across NUMA nodes, your kernel > > will > > > not swap because it simply has no reason to. > > > > I had swapiness arguments with the above people too, and had lobbied for > > the refresh nodes (again, non-Ceph) to be single-socket to avoid the NUMA > > factor that demonstrably was degrading performance. > > > > > > > > > > > > > > > > Best regards, > > > ================= > > > Frank Schilder > > > AIT Risø Campus > > > Bygning 109, rum S14 > > > > > > ________________________________________ > > > From: Tyler Stachecki <stachecki.tyler@xxxxxxxxx> > > > Sent: Wednesday, October 16, 2024 1:46 PM > > > To: Anthony D'Atri > > > Cc: Dave Hall; ceph-users > > > Subject: Re: Reef osd_memory_target and swapping > > > > > > On Tue, Oct 15, 2024, 1:38 PM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote: > > > > > >> > > >> > > >>> On Oct 15, 2024, at 1:06 PM, Dave Hall <kdhall@xxxxxxxxxxxxxx> wrote: > > >>> > > >>> Hello. > > >>> > > >>> I'm seeing the following in the Dashboard -> Configuration panel > > >>> for osd_memory_target: > > >>> > > >>> Default: > > >>> 4294967296 > > >>> > > >>> Current Values: > > >>> osd: 9797659437, > > >>> osd: 10408081664, > > >>> osd: 11381160192, > > >>> osd: 22260320563 > > >>> > > >>> I have 4 hoists in the cluster right now - all OSD+MGR+MON. 3 have > > 128GB > > >>> RAM, the 4th has 256GB. > > >> > > >> > > >> > > https://docs.ceph.com/en/reef/cephadm/services/osd/#automatically-tuning-osd-memory > > >> > > >> You have autotuning enabled, and it’s trying to use all of your physmem. > > >> I don’t know offhand how Ceph determines the amount of available > > memory, if > > >> it looks specifically for physmem or if it only looks at vmem. If it > > looks > > >> at vmem that arguably could be a bug > > >> > > >> > > >>> On the host with 256GB, top shows some OSD > > >>> processes with very high VIRT and RES values - the highest VIRT OSD has > > >>> 13.0g. The highest RES is 8.5g. > > >>> > > >>> All 4 systems are currently swapping, but the 256GB system has much > > >> higher > > >>> swap usage. > > >>> > > >>> I am confused why I have 4 current values for osd_memory_target, and > > >>> especially about the 4th one at 22GB. > > >>> > > >>> Also, I'm recalling that there might be a recommendation to disable > > swap. > > >>> and I could easily do 'swapoff -a' when the swap usage is lower than > > the > > >>> free RAM. > > >> > > >> I tend to advise not using swap at all. Suggest disabling swap in > > fstab, > > >> then serially rebooting your OSD nodes, of course waiting for recovery > > >> between each before proceeding to the next. > > >> > > >>> > > >>> Can anybody shed any light on this? > > >>> > > >>> Thanks. > > >>> > > >>> -Dave > > >>> > > >>> -- > > >>> Dave Hall > > >>> Binghamton University > > >>> kdhall@xxxxxxxxxxxxxx > > >>> _______________________________________________ > > >>> ceph-users mailing list -- ceph-users@xxxxxxx > > >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > >> > > >> _______________________________________________ > > >> ceph-users mailing list -- ceph-users@xxxxxxx > > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > > > > > > The swap recommendation is a contentious one - I, for one, have always > > been > > > against it. IMHO, disabling swap is a recommendation that comes up > > because > > > folks are afraid of their OSDs becoming sluggish when their hosts become > > > oversubscribed. > > > > > > But why not just avoid oversubscription altogether? > > > > > > If you set appropriate OSD memory targets, set kernel swapiness to > > > something like 10-20, and properly pin your OSDs in a system with >1 NUMA > > > node so that they're evenly distributed across NUMA nodes, your kernel > > will > > > not swap because it simply has no reason to. > > > > > > Because we leave swap enabled, we actually found that we were giving up > > > tons of performance -- after digging in when we saw swapping in some > > cases > > > previously, we found that the NUMA page balancer in the kernel was > > > shuffling pages around constantly before we had NUMA pinned the OSD > > > processes. If we had just disabled swap, the OSDs would have still become > > > sluggish and identifying why would have been a lot harder, because its > > not > > > enough for performance to tank... just start dropping off somewhat when > > > pages started dancing between nodes. > > > > > > Ever since we NUMA pinned our OSDs and set OSD memory targets > > > appropriately, not a byte has been swapped to disk in over a year across > > a > > > huge farm of OSDs (and they got noticably faster, too). > > > > > > Tyler > > > _______________________________________________ > > > ceph-users mailing list -- ceph-users@xxxxxxx > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx -- Alexander Patrakov _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx