Re: Reef osd_memory_target and swapping

Alexander Patrakov <patrakov@xxxxxxxxx> · Wed, 23 Oct 2024 05:48:20 +0800

Hello Dave,

Try this:

ceph config rm osd/host:ceph09 osd_memory_target

The document is here:
https://docs.ceph.com/en/reef/rados/configuration/ceph-conf/#sections-and-masks

On Wed, Oct 23, 2024 at 3:27 AM Dave Hall <kdhall@xxxxxxxxxxxxxx> wrote:
>
> Athony and all,
>
> At this point I would like to try to turn off osd_memory_target_autotune,
> reset my osd_memory_targets, and see how things work.  However, I have not
> figured out the incantation for removing (or even changing) the four
> separate osd_memory_target settings that I seem to have.
>
> # ceph config dump | grep osd_memory_target | cut -c 1-120
> osd                                host:ceph00  basic     osd_memory_target
>                              9795377735
> osd                                host:ceph01  basic     osd_memory_target
>                              10408081664
> osd                                host:ceph02  basic     osd_memory_target
>                              11381160192
> osd                                host:ceph09  basic     osd_memory_target
>                              22260320563
> osd                                             advanced
>  osd_memory_target_autotune                     true
>
>
> In the dashboard edit panel for osd_memory_target, only the fourth value is
> shown, so I guess managing this kind of multi-valued attribute via the
> dashboard is not possible.  Working with the CLI I have not found a way to
> get anything back except for the default value.
>
> I assume that there is a document somewhere that would explain the extended
> syntax for ceph config, but I haven't found it.
>
> Thanks.
>
> -Dave
>
> --
> Dave Hall
> Binghamton University
> kdhall@xxxxxxxxxxxxxx
>
> On Wed, Oct 16, 2024 at 12:02 PM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:
>
> >
> > > Unfortunately, its not quite that simple. At least until mimic, but
> > potentially later too there was this behavior that either the OSD's
> > allocator did not release or the kernel did not reclaim unused pages if
> > there was sufficient total memory available. Which implied pointless
> > swapping. The observation was exactly what Dave describes, huge resident
> > memory size without any load. The resident memory size just stayed high for
> > no apparent reason.
> >
> > I’ve seen that on non-Ceph systems too.  Sometimes with Ceph I see
> > tcmalloc not actually freeing unused mem; in those situations a “heap
> > release” on the admin socket does wonders.  I haven’t seen that since …
> > Nautilus perhaps.
> >
> > > The consequences were bad though, because during peering apparently the
> > "leaked memory" started playing a role and OSDs crashed due to pages on
> > swap not fitting into RAM.
> >
> > Back in the BSD days swap had to be >= physmem, these days we skate SysV
> > style where swap extends the VM space instead of backing it.
> >
> > > Having said that, we do have large swap partitions on disk for emergency
> > cases. We have swap off by default to avoid the memory "leak" issue and we
> > actually have sufficient RAM to begin with - maybe that's a bit of a luxury.
> >
> > I can’t argue with that strategy, if your boot drives are large enough.
> > I’ve as recently as this year suffered legacy systems with as little as
> > 100GB boot drives — so overly balkanized that no partition was large enough.
> >
> > K8s as I understand it won’t even run if swap is enabled.  Swap to me is
> > what we did in 1988 when RAM cost money and we had 3MB (yes) diskless (yes)
> > workstations.  Out of necessity.
> >
> > > The swap recommendation is a contentious one - I, for one, have always
> > been against it.
> >
> > Same here.  It’s a relic of the days when RAM was dramatically more
> > expensive.  I’ve had this argument with people stuck in the past, even when
> > the resident performance expert 100% agreed with me.
> >
> > >IMHO, disabling swap is a recommendation that comes up because folks are
> > afraid of their OSDs becoming sluggish when their hosts become
> > >oversubscribed.
> >
> > In part yes.  I tell people all the time that Ceph is usually better off
> > with a failed component than a crippled component.
> >
> > >But why not just avoid oversubscription altogether?
> >
> > Well, yeah.  In the above case, with non-Ceph systems, there were like
> > 2000 of them at unstaffed DCs around the world that were DellR430s with
> > only 64GB.  There was a closely-guarded secret that deployments were
> > blue-green so enough vmem was needed to run two copies for brief
> > intervals.  Upgrading them would have been prohibitively expensive, even if
> > they weren’t already like 8 years old.  Plus certain people were stubborn.
> >
> >
> > > If you set appropriate OSD memory targets, set kernel swapiness to
> > > something like 10-20, and properly pin your OSDs in a system with >1 NUMA
> > > node so that they're evenly distributed across NUMA nodes, your kernel
> > will
> > > not swap because it simply has no reason to.
> >
> > I had swapiness arguments with the above people too, and had lobbied for
> > the refresh nodes (again, non-Ceph) to be single-socket to avoid the NUMA
> > factor that demonstrably was degrading performance.
> >
> >
> >
> >
> >
> > >
> > > Best regards,
> > > =================
> > > Frank Schilder
> > > AIT Risø Campus
> > > Bygning 109, rum S14
> > >
> > > ________________________________________
> > > From: Tyler Stachecki <stachecki.tyler@xxxxxxxxx>
> > > Sent: Wednesday, October 16, 2024 1:46 PM
> > > To: Anthony D'Atri
> > > Cc: Dave Hall; ceph-users
> > > Subject:  Re: Reef osd_memory_target and swapping
> > >
> > > On Tue, Oct 15, 2024, 1:38 PM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:
> > >
> > >>
> > >>
> > >>> On Oct 15, 2024, at 1:06 PM, Dave Hall <kdhall@xxxxxxxxxxxxxx> wrote:
> > >>>
> > >>> Hello.
> > >>>
> > >>> I'm seeing the following in the Dashboard  -> Configuration panel
> > >>> for osd_memory_target:
> > >>>
> > >>> Default:
> > >>> 4294967296
> > >>>
> > >>> Current Values:
> > >>> osd: 9797659437,
> > >>> osd: 10408081664,
> > >>> osd: 11381160192,
> > >>> osd: 22260320563
> > >>>
> > >>> I have 4 hoists in the cluster right now - all OSD+MGR+MON.  3 have
> > 128GB
> > >>> RAM, the 4th has 256GB.
> > >>
> > >>
> > >>
> > https://docs.ceph.com/en/reef/cephadm/services/osd/#automatically-tuning-osd-memory
> > >>
> > >> You have autotuning enabled, and it’s trying to use all of your physmem.
> > >> I don’t know offhand how Ceph determines the amount of available
> > memory, if
> > >> it looks specifically for physmem or if it only looks at vmem.  If it
> > looks
> > >> at vmem that arguably could be a bug
> > >>
> > >>
> > >>> On the host with 256GB, top shows some OSD
> > >>> processes with very high VIRT and RES values - the highest VIRT OSD has
> > >>> 13.0g.  The highest RES is 8.5g.
> > >>>
> > >>> All 4 systems are currently swapping, but the 256GB system has much
> > >> higher
> > >>> swap usage.
> > >>>
> > >>> I am confused why I have 4 current values for osd_memory_target, and
> > >>> especially about the 4th one at 22GB.
> > >>>
> > >>> Also, I'm recalling that there might be a recommendation to disable
> > swap.
> > >>> and I could easily do 'swapoff -a' when the swap usage is lower than
> > the
> > >>> free RAM.
> > >>
> > >> I tend to advise not using swap at all.  Suggest disabling swap in
> > fstab,
> > >> then serially rebooting your OSD nodes, of course waiting for recovery
> > >> between each before proceeding to the next.
> > >>
> > >>>
> > >>> Can anybody shed any light on this?
> > >>>
> > >>> Thanks.
> > >>>
> > >>> -Dave
> > >>>
> > >>> --
> > >>> Dave Hall
> > >>> Binghamton University
> > >>> kdhall@xxxxxxxxxxxxxx
> > >>> _______________________________________________
> > >>> ceph-users mailing list -- ceph-users@xxxxxxx
> > >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > >>
> > >> _______________________________________________
> > >> ceph-users mailing list -- ceph-users@xxxxxxx
> > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > >
> > >
> > > The swap recommendation is a contentious one - I, for one, have always
> > been
> > > against it. IMHO, disabling swap is a recommendation that comes up
> > because
> > > folks are afraid of their OSDs becoming sluggish when their hosts become
> > > oversubscribed.
> > >
> > > But why not just avoid oversubscription altogether?
> > >
> > > If you set appropriate OSD memory targets, set kernel swapiness to
> > > something like 10-20, and properly pin your OSDs in a system with >1 NUMA
> > > node so that they're evenly distributed across NUMA nodes, your kernel
> > will
> > > not swap because it simply has no reason to.
> > >
> > > Because we leave swap enabled, we actually found that we were giving up
> > > tons of performance -- after digging in when we saw swapping in some
> > cases
> > > previously, we found that the NUMA page balancer in the kernel was
> > > shuffling pages around constantly before we had NUMA pinned the OSD
> > > processes. If we had just disabled swap, the OSDs would have still become
> > > sluggish and identifying why would have been a lot harder, because its
> > not
> > > enough for performance to tank... just start dropping off somewhat when
> > > pages started dancing between nodes.
> > >
> > > Ever since we NUMA pinned our OSDs and set OSD memory targets
> > > appropriately, not a byte has been swapped to disk in over a year across
> > a
> > > huge farm of OSDs (and they got noticably faster, too).
> > >
> > > Tyler
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

-- 
Alexander Patrakov
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx