Re: Reef osd_memory_target and swapping

Dave Hall <kdhall@xxxxxxxxxxxxxx> · Tue, 22 Oct 2024 15:26:44 -0400

Athony and all,

At this point I would like to try to turn off osd_memory_target_autotune,
reset my osd_memory_targets, and see how things work.  However, I have not
figured out the incantation for removing (or even changing) the four
separate osd_memory_target settings that I seem to have.

# ceph config dump | grep osd_memory_target | cut -c 1-120
osd                                host:ceph00  basic     osd_memory_target
                             9795377735
osd                                host:ceph01  basic     osd_memory_target
                             10408081664
osd                                host:ceph02  basic     osd_memory_target
                             11381160192
osd                                host:ceph09  basic     osd_memory_target
                             22260320563
osd                                             advanced
 osd_memory_target_autotune                     true

In the dashboard edit panel for osd_memory_target, only the fourth value is
shown, so I guess managing this kind of multi-valued attribute via the
dashboard is not possible.  Working with the CLI I have not found a way to
get anything back except for the default value.

I assume that there is a document somewhere that would explain the extended
syntax for ceph config, but I haven't found it.

Thanks.

-Dave

--
Dave Hall
Binghamton University
kdhall@xxxxxxxxxxxxxx

On Wed, Oct 16, 2024 at 12:02 PM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:

>
> > Unfortunately, its not quite that simple. At least until mimic, but
> potentially later too there was this behavior that either the OSD's
> allocator did not release or the kernel did not reclaim unused pages if
> there was sufficient total memory available. Which implied pointless
> swapping. The observation was exactly what Dave describes, huge resident
> memory size without any load. The resident memory size just stayed high for
> no apparent reason.
>
> I’ve seen that on non-Ceph systems too.  Sometimes with Ceph I see
> tcmalloc not actually freeing unused mem; in those situations a “heap
> release” on the admin socket does wonders.  I haven’t seen that since …
> Nautilus perhaps.
>
> > The consequences were bad though, because during peering apparently the
> "leaked memory" started playing a role and OSDs crashed due to pages on
> swap not fitting into RAM.
>
> Back in the BSD days swap had to be >= physmem, these days we skate SysV
> style where swap extends the VM space instead of backing it.
>
> > Having said that, we do have large swap partitions on disk for emergency
> cases. We have swap off by default to avoid the memory "leak" issue and we
> actually have sufficient RAM to begin with - maybe that's a bit of a luxury.
>
> I can’t argue with that strategy, if your boot drives are large enough.
> I’ve as recently as this year suffered legacy systems with as little as
> 100GB boot drives — so overly balkanized that no partition was large enough.
>
> K8s as I understand it won’t even run if swap is enabled.  Swap to me is
> what we did in 1988 when RAM cost money and we had 3MB (yes) diskless (yes)
> workstations.  Out of necessity.
>
> > The swap recommendation is a contentious one - I, for one, have always
> been against it.
>
> Same here.  It’s a relic of the days when RAM was dramatically more
> expensive.  I’ve had this argument with people stuck in the past, even when
> the resident performance expert 100% agreed with me.
>
> >IMHO, disabling swap is a recommendation that comes up because folks are
> afraid of their OSDs becoming sluggish when their hosts become
> >oversubscribed.
>
> In part yes.  I tell people all the time that Ceph is usually better off
> with a failed component than a crippled component.
>
> >But why not just avoid oversubscription altogether?
>
> Well, yeah.  In the above case, with non-Ceph systems, there were like
> 2000 of them at unstaffed DCs around the world that were DellR430s with
> only 64GB.  There was a closely-guarded secret that deployments were
> blue-green so enough vmem was needed to run two copies for brief
> intervals.  Upgrading them would have been prohibitively expensive, even if
> they weren’t already like 8 years old.  Plus certain people were stubborn.
>
>
> > If you set appropriate OSD memory targets, set kernel swapiness to
> > something like 10-20, and properly pin your OSDs in a system with >1 NUMA
> > node so that they're evenly distributed across NUMA nodes, your kernel
> will
> > not swap because it simply has no reason to.
>
> I had swapiness arguments with the above people too, and had lobbied for
> the refresh nodes (again, non-Ceph) to be single-socket to avoid the NUMA
> factor that demonstrably was degrading performance.
>
>
>
>
>
> >
> > Best regards,
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: Tyler Stachecki <stachecki.tyler@xxxxxxxxx>
> > Sent: Wednesday, October 16, 2024 1:46 PM
> > To: Anthony D'Atri
> > Cc: Dave Hall; ceph-users
> > Subject:  Re: Reef osd_memory_target and swapping
> >
> > On Tue, Oct 15, 2024, 1:38 PM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:
> >
> >>
> >>
> >>> On Oct 15, 2024, at 1:06 PM, Dave Hall <kdhall@xxxxxxxxxxxxxx> wrote:
> >>>
> >>> Hello.
> >>>
> >>> I'm seeing the following in the Dashboard  -> Configuration panel
> >>> for osd_memory_target:
> >>>
> >>> Default:
> >>> 4294967296
> >>>
> >>> Current Values:
> >>> osd: 9797659437,
> >>> osd: 10408081664,
> >>> osd: 11381160192,
> >>> osd: 22260320563
> >>>
> >>> I have 4 hoists in the cluster right now - all OSD+MGR+MON.  3 have
> 128GB
> >>> RAM, the 4th has 256GB.
> >>
> >>
> >>
> https://docs.ceph.com/en/reef/cephadm/services/osd/#automatically-tuning-osd-memory
> >>
> >> You have autotuning enabled, and it’s trying to use all of your physmem.
> >> I don’t know offhand how Ceph determines the amount of available
> memory, if
> >> it looks specifically for physmem or if it only looks at vmem.  If it
> looks
> >> at vmem that arguably could be a bug
> >>
> >>
> >>> On the host with 256GB, top shows some OSD
> >>> processes with very high VIRT and RES values - the highest VIRT OSD has
> >>> 13.0g.  The highest RES is 8.5g.
> >>>
> >>> All 4 systems are currently swapping, but the 256GB system has much
> >> higher
> >>> swap usage.
> >>>
> >>> I am confused why I have 4 current values for osd_memory_target, and
> >>> especially about the 4th one at 22GB.
> >>>
> >>> Also, I'm recalling that there might be a recommendation to disable
> swap.
> >>> and I could easily do 'swapoff -a' when the swap usage is lower than
> the
> >>> free RAM.
> >>
> >> I tend to advise not using swap at all.  Suggest disabling swap in
> fstab,
> >> then serially rebooting your OSD nodes, of course waiting for recovery
> >> between each before proceeding to the next.
> >>
> >>>
> >>> Can anybody shed any light on this?
> >>>
> >>> Thanks.
> >>>
> >>> -Dave
> >>>
> >>> --
> >>> Dave Hall
> >>> Binghamton University
> >>> kdhall@xxxxxxxxxxxxxx
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> >
> > The swap recommendation is a contentious one - I, for one, have always
> been
> > against it. IMHO, disabling swap is a recommendation that comes up
> because
> > folks are afraid of their OSDs becoming sluggish when their hosts become
> > oversubscribed.
> >
> > But why not just avoid oversubscription altogether?
> >
> > If you set appropriate OSD memory targets, set kernel swapiness to
> > something like 10-20, and properly pin your OSDs in a system with >1 NUMA
> > node so that they're evenly distributed across NUMA nodes, your kernel
> will
> > not swap because it simply has no reason to.
> >
> > Because we leave swap enabled, we actually found that we were giving up
> > tons of performance -- after digging in when we saw swapping in some
> cases
> > previously, we found that the NUMA page balancer in the kernel was
> > shuffling pages around constantly before we had NUMA pinned the OSD
> > processes. If we had just disabled swap, the OSDs would have still become
> > sluggish and identifying why would have been a lot harder, because its
> not
> > enough for performance to tank... just start dropping off somewhat when
> > pages started dancing between nodes.
> >
> > Ever since we NUMA pinned our OSDs and set OSD memory targets
> > appropriately, not a byte has been swapped to disk in over a year across
> a
> > huge farm of OSDs (and they got noticably faster, too).
> >
> > Tyler
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx