Re: Reef osd_memory_target and swapping

Tyler Stachecki <stachecki.tyler@xxxxxxxxx> · Wed, 16 Oct 2024 07:46:53 -0400

On Tue, Oct 15, 2024, 1:38 PM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:

>
>
> > On Oct 15, 2024, at 1:06 PM, Dave Hall <kdhall@xxxxxxxxxxxxxx> wrote:
> >
> > Hello.
> >
> > I'm seeing the following in the Dashboard  -> Configuration panel
> > for osd_memory_target:
> >
> > Default:
> > 4294967296
> >
> > Current Values:
> > osd: 9797659437,
> > osd: 10408081664,
> > osd: 11381160192,
> > osd: 22260320563
> >
> > I have 4 hoists in the cluster right now - all OSD+MGR+MON.  3 have 128GB
> > RAM, the 4th has 256GB.
>
>
> https://docs.ceph.com/en/reef/cephadm/services/osd/#automatically-tuning-osd-memory
>
> You have autotuning enabled, and it’s trying to use all of your physmem.
> I don’t know offhand how Ceph determines the amount of available memory, if
> it looks specifically for physmem or if it only looks at vmem.  If it looks
> at vmem that arguably could be a bug
>
>
> >  On the host with 256GB, top shows some OSD
> > processes with very high VIRT and RES values - the highest VIRT OSD has
> > 13.0g.  The highest RES is 8.5g.
> >
> > All 4 systems are currently swapping, but the 256GB system has much
> higher
> > swap usage.
> >
> > I am confused why I have 4 current values for osd_memory_target, and
> > especially about the 4th one at 22GB.
> >
> > Also, I'm recalling that there might be a recommendation to disable swap.
> > and I could easily do 'swapoff -a' when the swap usage is lower than the
> > free RAM.
>
> I tend to advise not using swap at all.  Suggest disabling swap in fstab,
> then serially rebooting your OSD nodes, of course waiting for recovery
> between each before proceeding to the next.
>
> >
> > Can anybody shed any light on this?
> >
> > Thanks.
> >
> > -Dave
> >
> > --
> > Dave Hall
> > Binghamton University
> > kdhall@xxxxxxxxxxxxxx
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

The swap recommendation is a contentious one - I, for one, have always been
against it. IMHO, disabling swap is a recommendation that comes up because
folks are afraid of their OSDs becoming sluggish when their hosts become
oversubscribed.

But why not just avoid oversubscription altogether?

If you set appropriate OSD memory targets, set kernel swapiness to
something like 10-20, and properly pin your OSDs in a system with >1 NUMA
node so that they're evenly distributed across NUMA nodes, your kernel will
not swap because it simply has no reason to.

Because we leave swap enabled, we actually found that we were giving up
tons of performance -- after digging in when we saw swapping in some cases
previously, we found that the NUMA page balancer in the kernel was
shuffling pages around constantly before we had NUMA pinned the OSD
processes. If we had just disabled swap, the OSDs would have still become
sluggish and identifying why would have been a lot harder, because its not
enough for performance to tank... just start dropping off somewhat when
pages started dancing between nodes.

Ever since we NUMA pinned our OSDs and set OSD memory targets
appropriately, not a byte has been swapped to disk in over a year across a
huge farm of OSDs (and they got noticably faster, too).

Tyler
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx