Re: Reef osd_memory_target and swapping

Frank Schilder <frans@xxxxxx> · Wed, 16 Oct 2024 15:02:41 +0000

> If you set appropriate OSD memory targets, set kernel swapiness to
> something like 10-20, and properly pin your OSDs in a system with >1 NUMA
> node so that they're evenly distributed across NUMA nodes, your kernel will
> not swap because it simply has no reason to.

Unfortunately, its not quite that simple. At least until mimic, but potentially later too there was this behavior that either the OSD's allocator did not release or the kernel did not reclaim unused pages if there was sufficient total memory available. Which implied pointless swapping. The observation was exactly what Dave describes, huge resident memory size without any load. The resident memory size just stayed high for no apparent reason.

The consequences were bad though, because during peering apparently the "leaked memory" started playing a role and OSDs crashed due to pages on swap not fitting into RAM.

On mimic, disabling swap solved this issue and one can argue about whether or not this is a bug in the allocator code. The recommendations by the kernel developers for swap on or off are - as far as I remember - somewhat along the lines that if enough RAM is available the system doesn't really profit from having swap (and swap size should be something like 1-4G only any ways). The situation in which mapping pages out to swap becomes useful should probably never occur on a properly dimensioned OSD host.

Having said that, we do have large swap partitions on disk for emergency cases. We have swap off by default to avoid the memory "leak" issue and we actually have sufficient RAM to begin with - maybe that's a bit of a luxury.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Tyler Stachecki <stachecki.tyler@xxxxxxxxx>
Sent: Wednesday, October 16, 2024 1:46 PM
To: Anthony D'Atri
Cc: Dave Hall; ceph-users
Subject:  Re: Reef osd_memory_target and swapping

On Tue, Oct 15, 2024, 1:38 PM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:

>
>
> > On Oct 15, 2024, at 1:06 PM, Dave Hall <kdhall@xxxxxxxxxxxxxx> wrote:
> >
> > Hello.
> >
> > I'm seeing the following in the Dashboard  -> Configuration panel
> > for osd_memory_target:
> >
> > Default:
> > 4294967296
> >
> > Current Values:
> > osd: 9797659437,
> > osd: 10408081664,
> > osd: 11381160192,
> > osd: 22260320563
> >
> > I have 4 hoists in the cluster right now - all OSD+MGR+MON.  3 have 128GB
> > RAM, the 4th has 256GB.
>
>
> https://docs.ceph.com/en/reef/cephadm/services/osd/#automatically-tuning-osd-memory
>
> You have autotuning enabled, and it’s trying to use all of your physmem.
> I don’t know offhand how Ceph determines the amount of available memory, if
> it looks specifically for physmem or if it only looks at vmem.  If it looks
> at vmem that arguably could be a bug
>
>
> >  On the host with 256GB, top shows some OSD
> > processes with very high VIRT and RES values - the highest VIRT OSD has
> > 13.0g.  The highest RES is 8.5g.
> >
> > All 4 systems are currently swapping, but the 256GB system has much
> higher
> > swap usage.
> >
> > I am confused why I have 4 current values for osd_memory_target, and
> > especially about the 4th one at 22GB.
> >
> > Also, I'm recalling that there might be a recommendation to disable swap.
> > and I could easily do 'swapoff -a' when the swap usage is lower than the
> > free RAM.
>
> I tend to advise not using swap at all.  Suggest disabling swap in fstab,
> then serially rebooting your OSD nodes, of course waiting for recovery
> between each before proceeding to the next.
>
> >
> > Can anybody shed any light on this?
> >
> > Thanks.
> >
> > -Dave
> >
> > --
> > Dave Hall
> > Binghamton University
> > kdhall@xxxxxxxxxxxxxx
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

The swap recommendation is a contentious one - I, for one, have always been
against it. IMHO, disabling swap is a recommendation that comes up because
folks are afraid of their OSDs becoming sluggish when their hosts become
oversubscribed.

But why not just avoid oversubscription altogether?

If you set appropriate OSD memory targets, set kernel swapiness to
something like 10-20, and properly pin your OSDs in a system with >1 NUMA
node so that they're evenly distributed across NUMA nodes, your kernel will
not swap because it simply has no reason to.

Because we leave swap enabled, we actually found that we were giving up
tons of performance -- after digging in when we saw swapping in some cases
previously, we found that the NUMA page balancer in the kernel was
shuffling pages around constantly before we had NUMA pinned the OSD
processes. If we had just disabled swap, the OSDs would have still become
sluggish and identifying why would have been a lot harder, because its not
enough for performance to tank... just start dropping off somewhat when
pages started dancing between nodes.

Ever since we NUMA pinned our OSDs and set OSD memory targets
appropriately, not a byte has been swapped to disk in over a year across a
huge farm of OSDs (and they got noticably faster, too).

Tyler
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx