Re: Reef osd_memory_target and swapping

"Anthony D'Atri" <aad@xxxxxxxxxxxxxx> · Wed, 16 Oct 2024 12:02:06 -0400

> Unfortunately, its not quite that simple. At least until mimic, but potentially later too there was this behavior that either the OSD's allocator did not release or the kernel did not reclaim unused pages if there was sufficient total memory available. Which implied pointless swapping. The observation was exactly what Dave describes, huge resident memory size without any load. The resident memory size just stayed high for no apparent reason.

I’ve seen that on non-Ceph systems too.  Sometimes with Ceph I see tcmalloc not actually freeing unused mem; in those situations a “heap release” on the admin socket does wonders.  I haven’t seen that since … Nautilus perhaps.

> The consequences were bad though, because during peering apparently the "leaked memory" started playing a role and OSDs crashed due to pages on swap not fitting into RAM.

Back in the BSD days swap had to be >= physmem, these days we skate SysV style where swap extends the VM space instead of backing it.

> Having said that, we do have large swap partitions on disk for emergency cases. We have swap off by default to avoid the memory "leak" issue and we actually have sufficient RAM to begin with - maybe that's a bit of a luxury.

I can’t argue with that strategy, if your boot drives are large enough.  I’ve as recently as this year suffered legacy systems with as little as 100GB boot drives — so overly balkanized that no partition was large enough.

K8s as I understand it won’t even run if swap is enabled.  Swap to me is what we did in 1988 when RAM cost money and we had 3MB (yes) diskless (yes) workstations.  Out of necessity.

> The swap recommendation is a contentious one - I, for one, have always been against it.

Same here.  It’s a relic of the days when RAM was dramatically more expensive.  I’ve had this argument with people stuck in the past, even when the resident performance expert 100% agreed with me.  

>IMHO, disabling swap is a recommendation that comes up because folks are afraid of their OSDs becoming sluggish when their hosts become
>oversubscribed.

In part yes.  I tell people all the time that Ceph is usually better off with a failed component than a crippled component.

>But why not just avoid oversubscription altogether?

Well, yeah.  In the above case, with non-Ceph systems, there were like 2000 of them at unstaffed DCs around the world that were DellR430s with only 64GB.  There was a closely-guarded secret that deployments were blue-green so enough vmem was needed to run two copies for brief intervals.  Upgrading them would have been prohibitively expensive, even if they weren’t already like 8 years old.  Plus certain people were stubborn.

> If you set appropriate OSD memory targets, set kernel swapiness to
> something like 10-20, and properly pin your OSDs in a system with >1 NUMA
> node so that they're evenly distributed across NUMA nodes, your kernel will
> not swap because it simply has no reason to.

I had swapiness arguments with the above people too, and had lobbied for the refresh nodes (again, non-Ceph) to be single-socket to avoid the NUMA factor that demonstrably was degrading performance.

> 
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> ________________________________________
> From: Tyler Stachecki <stachecki.tyler@xxxxxxxxx>
> Sent: Wednesday, October 16, 2024 1:46 PM
> To: Anthony D'Atri
> Cc: Dave Hall; ceph-users
> Subject:  Re: Reef osd_memory_target and swapping
> 
> On Tue, Oct 15, 2024, 1:38 PM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:
> 
>> 
>> 
>>> On Oct 15, 2024, at 1:06 PM, Dave Hall <kdhall@xxxxxxxxxxxxxx> wrote:
>>> 
>>> Hello.
>>> 
>>> I'm seeing the following in the Dashboard  -> Configuration panel
>>> for osd_memory_target:
>>> 
>>> Default:
>>> 4294967296
>>> 
>>> Current Values:
>>> osd: 9797659437,
>>> osd: 10408081664,
>>> osd: 11381160192,
>>> osd: 22260320563
>>> 
>>> I have 4 hoists in the cluster right now - all OSD+MGR+MON.  3 have 128GB
>>> RAM, the 4th has 256GB.
>> 
>> 
>> https://docs.ceph.com/en/reef/cephadm/services/osd/#automatically-tuning-osd-memory
>> 
>> You have autotuning enabled, and it’s trying to use all of your physmem.
>> I don’t know offhand how Ceph determines the amount of available memory, if
>> it looks specifically for physmem or if it only looks at vmem.  If it looks
>> at vmem that arguably could be a bug
>> 
>> 
>>> On the host with 256GB, top shows some OSD
>>> processes with very high VIRT and RES values - the highest VIRT OSD has
>>> 13.0g.  The highest RES is 8.5g.
>>> 
>>> All 4 systems are currently swapping, but the 256GB system has much
>> higher
>>> swap usage.
>>> 
>>> I am confused why I have 4 current values for osd_memory_target, and
>>> especially about the 4th one at 22GB.
>>> 
>>> Also, I'm recalling that there might be a recommendation to disable swap.
>>> and I could easily do 'swapoff -a' when the swap usage is lower than the
>>> free RAM.
>> 
>> I tend to advise not using swap at all.  Suggest disabling swap in fstab,
>> then serially rebooting your OSD nodes, of course waiting for recovery
>> between each before proceeding to the next.
>> 
>>> 
>>> Can anybody shed any light on this?
>>> 
>>> Thanks.
>>> 
>>> -Dave
>>> 
>>> --
>>> Dave Hall
>>> Binghamton University
>>> kdhall@xxxxxxxxxxxxxx
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> 
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
> 
> The swap recommendation is a contentious one - I, for one, have always been
> against it. IMHO, disabling swap is a recommendation that comes up because
> folks are afraid of their OSDs becoming sluggish when their hosts become
> oversubscribed.
> 
> But why not just avoid oversubscription altogether?
> 
> If you set appropriate OSD memory targets, set kernel swapiness to
> something like 10-20, and properly pin your OSDs in a system with >1 NUMA
> node so that they're evenly distributed across NUMA nodes, your kernel will
> not swap because it simply has no reason to.
> 
> Because we leave swap enabled, we actually found that we were giving up
> tons of performance -- after digging in when we saw swapping in some cases
> previously, we found that the NUMA page balancer in the kernel was
> shuffling pages around constantly before we had NUMA pinned the OSD
> processes. If we had just disabled swap, the OSDs would have still become
> sluggish and identifying why would have been a lot harder, because its not
> enough for performance to tank... just start dropping off somewhat when
> pages started dancing between nodes.
> 
> Ever since we NUMA pinned our OSDs and set OSD memory targets
> appropriately, not a byte has been swapped to disk in over a year across a
> huge farm of OSDs (and they got noticably faster, too).
> 
> Tyler
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx