Athony and all, At this point I would like to try to turn off osd_memory_target_autotune, reset my osd_memory_targets, and see how things work. However, I have not figured out the incantation for removing (or even changing) the four separate osd_memory_target settings that I seem to have. # ceph config dump | grep osd_memory_target | cut -c 1-120 osd host:ceph00 basic osd_memory_target 9795377735 osd host:ceph01 basic osd_memory_target 10408081664 osd host:ceph02 basic osd_memory_target 11381160192 osd host:ceph09 basic osd_memory_target 22260320563 osd advanced osd_memory_target_autotune true In the dashboard edit panel for osd_memory_target, only the fourth value is shown, so I guess managing this kind of multi-valued attribute via the dashboard is not possible. Working with the CLI I have not found a way to get anything back except for the default value. I assume that there is a document somewhere that would explain the extended syntax for ceph config, but I haven't found it. Thanks. -Dave -- Dave Hall Binghamton University kdhall@xxxxxxxxxxxxxx On Wed, Oct 16, 2024 at 12:02 PM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote: > > > Unfortunately, its not quite that simple. At least until mimic, but > potentially later too there was this behavior that either the OSD's > allocator did not release or the kernel did not reclaim unused pages if > there was sufficient total memory available. Which implied pointless > swapping. The observation was exactly what Dave describes, huge resident > memory size without any load. The resident memory size just stayed high for > no apparent reason. > > I’ve seen that on non-Ceph systems too. Sometimes with Ceph I see > tcmalloc not actually freeing unused mem; in those situations a “heap > release” on the admin socket does wonders. I haven’t seen that since … > Nautilus perhaps. > > > The consequences were bad though, because during peering apparently the > "leaked memory" started playing a role and OSDs crashed due to pages on > swap not fitting into RAM. > > Back in the BSD days swap had to be >= physmem, these days we skate SysV > style where swap extends the VM space instead of backing it. > > > Having said that, we do have large swap partitions on disk for emergency > cases. We have swap off by default to avoid the memory "leak" issue and we > actually have sufficient RAM to begin with - maybe that's a bit of a luxury. > > I can’t argue with that strategy, if your boot drives are large enough. > I’ve as recently as this year suffered legacy systems with as little as > 100GB boot drives — so overly balkanized that no partition was large enough. > > K8s as I understand it won’t even run if swap is enabled. Swap to me is > what we did in 1988 when RAM cost money and we had 3MB (yes) diskless (yes) > workstations. Out of necessity. > > > The swap recommendation is a contentious one - I, for one, have always > been against it. > > Same here. It’s a relic of the days when RAM was dramatically more > expensive. I’ve had this argument with people stuck in the past, even when > the resident performance expert 100% agreed with me. > > >IMHO, disabling swap is a recommendation that comes up because folks are > afraid of their OSDs becoming sluggish when their hosts become > >oversubscribed. > > In part yes. I tell people all the time that Ceph is usually better off > with a failed component than a crippled component. > > >But why not just avoid oversubscription altogether? > > Well, yeah. In the above case, with non-Ceph systems, there were like > 2000 of them at unstaffed DCs around the world that were DellR430s with > only 64GB. There was a closely-guarded secret that deployments were > blue-green so enough vmem was needed to run two copies for brief > intervals. Upgrading them would have been prohibitively expensive, even if > they weren’t already like 8 years old. Plus certain people were stubborn. > > > > If you set appropriate OSD memory targets, set kernel swapiness to > > something like 10-20, and properly pin your OSDs in a system with >1 NUMA > > node so that they're evenly distributed across NUMA nodes, your kernel > will > > not swap because it simply has no reason to. > > I had swapiness arguments with the above people too, and had lobbied for > the refresh nodes (again, non-Ceph) to be single-socket to avoid the NUMA > factor that demonstrably was degrading performance. > > > > > > > > > Best regards, > > ================= > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > > > ________________________________________ > > From: Tyler Stachecki <stachecki.tyler@xxxxxxxxx> > > Sent: Wednesday, October 16, 2024 1:46 PM > > To: Anthony D'Atri > > Cc: Dave Hall; ceph-users > > Subject: Re: Reef osd_memory_target and swapping > > > > On Tue, Oct 15, 2024, 1:38 PM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote: > > > >> > >> > >>> On Oct 15, 2024, at 1:06 PM, Dave Hall <kdhall@xxxxxxxxxxxxxx> wrote: > >>> > >>> Hello. > >>> > >>> I'm seeing the following in the Dashboard -> Configuration panel > >>> for osd_memory_target: > >>> > >>> Default: > >>> 4294967296 > >>> > >>> Current Values: > >>> osd: 9797659437, > >>> osd: 10408081664, > >>> osd: 11381160192, > >>> osd: 22260320563 > >>> > >>> I have 4 hoists in the cluster right now - all OSD+MGR+MON. 3 have > 128GB > >>> RAM, the 4th has 256GB. > >> > >> > >> > https://docs.ceph.com/en/reef/cephadm/services/osd/#automatically-tuning-osd-memory > >> > >> You have autotuning enabled, and it’s trying to use all of your physmem. > >> I don’t know offhand how Ceph determines the amount of available > memory, if > >> it looks specifically for physmem or if it only looks at vmem. If it > looks > >> at vmem that arguably could be a bug > >> > >> > >>> On the host with 256GB, top shows some OSD > >>> processes with very high VIRT and RES values - the highest VIRT OSD has > >>> 13.0g. The highest RES is 8.5g. > >>> > >>> All 4 systems are currently swapping, but the 256GB system has much > >> higher > >>> swap usage. > >>> > >>> I am confused why I have 4 current values for osd_memory_target, and > >>> especially about the 4th one at 22GB. > >>> > >>> Also, I'm recalling that there might be a recommendation to disable > swap. > >>> and I could easily do 'swapoff -a' when the swap usage is lower than > the > >>> free RAM. > >> > >> I tend to advise not using swap at all. Suggest disabling swap in > fstab, > >> then serially rebooting your OSD nodes, of course waiting for recovery > >> between each before proceeding to the next. > >> > >>> > >>> Can anybody shed any light on this? > >>> > >>> Thanks. > >>> > >>> -Dave > >>> > >>> -- > >>> Dave Hall > >>> Binghamton University > >>> kdhall@xxxxxxxxxxxxxx > >>> _______________________________________________ > >>> ceph-users mailing list -- ceph-users@xxxxxxx > >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > > > The swap recommendation is a contentious one - I, for one, have always > been > > against it. IMHO, disabling swap is a recommendation that comes up > because > > folks are afraid of their OSDs becoming sluggish when their hosts become > > oversubscribed. > > > > But why not just avoid oversubscription altogether? > > > > If you set appropriate OSD memory targets, set kernel swapiness to > > something like 10-20, and properly pin your OSDs in a system with >1 NUMA > > node so that they're evenly distributed across NUMA nodes, your kernel > will > > not swap because it simply has no reason to. > > > > Because we leave swap enabled, we actually found that we were giving up > > tons of performance -- after digging in when we saw swapping in some > cases > > previously, we found that the NUMA page balancer in the kernel was > > shuffling pages around constantly before we had NUMA pinned the OSD > > processes. If we had just disabled swap, the OSDs would have still become > > sluggish and identifying why would have been a lot harder, because its > not > > enough for performance to tank... just start dropping off somewhat when > > pages started dancing between nodes. > > > > Ever since we NUMA pinned our OSDs and set OSD memory targets > > appropriately, not a byte has been swapped to disk in over a year across > a > > huge farm of OSDs (and they got noticably faster, too). > > > > Tyler > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx