Re: Reef osd_memory_target and swapping

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Wed, 23 Oct 2024 04:35:11 +0200 (CEST)

Hi Dave,

After removing the per host osd_memory_targets with the command Alex just shared, I would advise you to disable swap and reboot these OSD nodes (the reboot step is important). In the past we've had issues with swap interfering badly with OSD memory calculation ending up with OSDs swapping and nodes becoming unresponsive.after disabling swap, OSDs memory consumption would finally stay within what we allowed them to use (osd_memory_target).

Also be aware that per host settings (hosts masks) were not always effective in the past due to this bug [1], leaving the memory autotuner useless, as whatever it set for osd_memory_target on a per host basis, the OSDs would ignore it and start with the default osd_memory_target value of 4GB. We used to set osd_memory_target at the rack level to work around this bug in the past.

Check the osd_memory_target value of running OSDs with 'ceph config show osd.x osd_memory_target' to make sure what is set is what applies.

Regards,
Frédéric.

[1] https://tracker.ceph.com/issues/48750
________________________________
De : Dave Hall <kdhall@xxxxxxxxxxxxxx>
Envoyé : mardi 22 octobre 2024 21:28
À : Anthony D'Atri
Cc: Tyler Stachecki; ceph-users 
Objet :  Re: Reef osd_memory_target and swapping

Athony and all, 

At this point I would like to try to turn off osd_memory_target_autotune, 
reset my osd_memory_targets, and see how things work.  However, I have not 
figured out the incantation for removing (or even changing) the four 
separate osd_memory_target settings that I seem to have. 

# ceph config dump | grep osd_memory_target | cut -c 1-120 
osd                                host:ceph00  basic     osd_memory_target 
                             9795377735 
osd                                host:ceph01  basic     osd_memory_target 
                             10408081664 
osd                                host:ceph02  basic     osd_memory_target 
                             11381160192 
osd                                host:ceph09  basic     osd_memory_target 
                             22260320563 
osd                                             advanced 
osd_memory_target_autotune                     true 

In the dashboard edit panel for osd_memory_target, only the fourth value is 
shown, so I guess managing this kind of multi-valued attribute via the 
dashboard is not possible.  Working with the CLI I have not found a way to 
get anything back except for the default value. 

I assume that there is a document somewhere that would explain the extended 
syntax for ceph config, but I haven't found it. 

Thanks. 

-Dave 

-- 
Dave Hall 
Binghamton University 
kdhall@xxxxxxxxxxxxxx 

On Wed, Oct 16, 2024 at 12:02 PM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote: 

> 
> > Unfortunately, its not quite that simple. At least until mimic, but 
> potentially later too there was this behavior that either the OSD's 
> allocator did not release or the kernel did not reclaim unused pages if 
> there was sufficient total memory available. Which implied pointless 
> swapping. The observation was exactly what Dave describes, huge resident 
> memory size without any load. The resident memory size just stayed high for 
> no apparent reason. 
> 
> I’ve seen that on non-Ceph systems too.  Sometimes with Ceph I see 
> tcmalloc not actually freeing unused mem; in those situations a “heap 
> release” on the admin socket does wonders.  I haven’t seen that since … 
> Nautilus perhaps. 
> 
> > The consequences were bad though, because during peering apparently the 
> "leaked memory" started playing a role and OSDs crashed due to pages on 
> swap not fitting into RAM. 
> 
> Back in the BSD days swap had to be >= physmem, these days we skate SysV 
> style where swap extends the VM space instead of backing it. 
> 
> > Having said that, we do have large swap partitions on disk for emergency 
> cases. We have swap off by default to avoid the memory "leak" issue and we 
> actually have sufficient RAM to begin with - maybe that's a bit of a luxury. 
> 
> I can’t argue with that strategy, if your boot drives are large enough. 
> I’ve as recently as this year suffered legacy systems with as little as 
> 100GB boot drives — so overly balkanized that no partition was large enough. 
> 
> K8s as I understand it won’t even run if swap is enabled.  Swap to me is 
> what we did in 1988 when RAM cost money and we had 3MB (yes) diskless (yes) 
> workstations.  Out of necessity. 
> 
> > The swap recommendation is a contentious one - I, for one, have always 
> been against it. 
> 
> Same here.  It’s a relic of the days when RAM was dramatically more 
> expensive.  I’ve had this argument with people stuck in the past, even when 
> the resident performance expert 100% agreed with me. 
> 
> >IMHO, disabling swap is a recommendation that comes up because folks are 
> afraid of their OSDs becoming sluggish when their hosts become 
> >oversubscribed. 
> 
> In part yes.  I tell people all the time that Ceph is usually better off 
> with a failed component than a crippled component. 
> 
> >But why not just avoid oversubscription altogether? 
> 
> Well, yeah.  In the above case, with non-Ceph systems, there were like 
> 2000 of them at unstaffed DCs around the world that were DellR430s with 
> only 64GB.  There was a closely-guarded secret that deployments were 
> blue-green so enough vmem was needed to run two copies for brief 
> intervals.  Upgrading them would have been prohibitively expensive, even if 
> they weren’t already like 8 years old.  Plus certain people were stubborn. 
> 
> 
> > If you set appropriate OSD memory targets, set kernel swapiness to 
> > something like 10-20, and properly pin your OSDs in a system with >1 NUMA 
> > node so that they're evenly distributed across NUMA nodes, your kernel 
> will 
> > not swap because it simply has no reason to. 
> 
> I had swapiness arguments with the above people too, and had lobbied for 
> the refresh nodes (again, non-Ceph) to be single-socket to avoid the NUMA 
> factor that demonstrably was degrading performance. 
> 
> 
> 
> 
> 
> > 
> > Best regards, 
> > ================= 
> > Frank Schilder 
> > AIT Risø Campus 
> > Bygning 109, rum S14 
> > 
> > ________________________________________ 
> > From: Tyler Stachecki <stachecki.tyler@xxxxxxxxx> 
> > Sent: Wednesday, October 16, 2024 1:46 PM 
> > To: Anthony D'Atri 
> > Cc: Dave Hall; ceph-users 
> > Subject:  Re: Reef osd_memory_target and swapping 
> > 
> > On Tue, Oct 15, 2024, 1:38 PM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote: 
> > 
> >> 
> >> 
> >>> On Oct 15, 2024, at 1:06 PM, Dave Hall <kdhall@xxxxxxxxxxxxxx> wrote: 
> >>> 
> >>> Hello. 
> >>> 
> >>> I'm seeing the following in the Dashboard  -> Configuration panel 
> >>> for osd_memory_target: 
> >>> 
> >>> Default: 
> >>> 4294967296 
> >>> 
> >>> Current Values: 
> >>> osd: 9797659437, 
> >>> osd: 10408081664, 
> >>> osd: 11381160192, 
> >>> osd: 22260320563 
> >>> 
> >>> I have 4 hoists in the cluster right now - all OSD+MGR+MON.  3 have 
> 128GB 
> >>> RAM, the 4th has 256GB. 
> >> 
> >> 
> >> 
> https://docs.ceph.com/en/reef/cephadm/services/osd/#automatically-tuning-osd-memory 
> >> 
> >> You have autotuning enabled, and it’s trying to use all of your physmem. 
> >> I don’t know offhand how Ceph determines the amount of available 
> memory, if 
> >> it looks specifically for physmem or if it only looks at vmem.  If it 
> looks 
> >> at vmem that arguably could be a bug 
> >> 
> >> 
> >>> On the host with 256GB, top shows some OSD 
> >>> processes with very high VIRT and RES values - the highest VIRT OSD has 
> >>> 13.0g.  The highest RES is 8.5g. 
> >>> 
> >>> All 4 systems are currently swapping, but the 256GB system has much 
> >> higher 
> >>> swap usage. 
> >>> 
> >>> I am confused why I have 4 current values for osd_memory_target, and 
> >>> especially about the 4th one at 22GB. 
> >>> 
> >>> Also, I'm recalling that there might be a recommendation to disable 
> swap. 
> >>> and I could easily do 'swapoff -a' when the swap usage is lower than 
> the 
> >>> free RAM. 
> >> 
> >> I tend to advise not using swap at all.  Suggest disabling swap in 
> fstab, 
> >> then serially rebooting your OSD nodes, of course waiting for recovery 
> >> between each before proceeding to the next. 
> >> 
> >>> 
> >>> Can anybody shed any light on this? 
> >>> 
> >>> Thanks. 
> >>> 
> >>> -Dave 
> >>> 
> >>> -- 
> >>> Dave Hall 
> >>> Binghamton University 
> >>> kdhall@xxxxxxxxxxxxxx 
> >>> _______________________________________________ 
> >>> ceph-users mailing list -- ceph-users@xxxxxxx 
> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx 
> >> 
> >> _______________________________________________ 
> >> ceph-users mailing list -- ceph-users@xxxxxxx 
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx 
> > 
> > 
> > The swap recommendation is a contentious one - I, for one, have always 
> been 
> > against it. IMHO, disabling swap is a recommendation that comes up 
> because 
> > folks are afraid of their OSDs becoming sluggish when their hosts become 
> > oversubscribed. 
> > 
> > But why not just avoid oversubscription altogether? 
> > 
> > If you set appropriate OSD memory targets, set kernel swapiness to 
> > something like 10-20, and properly pin your OSDs in a system with >1 NUMA 
> > node so that they're evenly distributed across NUMA nodes, your kernel 
> will 
> > not swap because it simply has no reason to. 
> > 
> > Because we leave swap enabled, we actually found that we were giving up 
> > tons of performance -- after digging in when we saw swapping in some 
> cases 
> > previously, we found that the NUMA page balancer in the kernel was 
> > shuffling pages around constantly before we had NUMA pinned the OSD 
> > processes. If we had just disabled swap, the OSDs would have still become 
> > sluggish and identifying why would have been a lot harder, because its 
> not 
> > enough for performance to tank... just start dropping off somewhat when 
> > pages started dancing between nodes. 
> > 
> > Ever since we NUMA pinned our OSDs and set OSD memory targets 
> > appropriately, not a byte has been swapped to disk in over a year across 
> a 
> > huge farm of OSDs (and they got noticably faster, too). 
> > 
> > Tyler 
> > _______________________________________________ 
> > ceph-users mailing list -- ceph-users@xxxxxxx 
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx 
> 
> 
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx