Re: Fwd: Lots of OSDs crashlooping (DRAFT - feedback?)

Benjamin Staffin <bstaffin@xxxxxxxxxxxxxxx> · Tue, 25 Jan 2022 16:48:06 -0500

Thank you for your responses!

Since yesterday we found that several OSD pods still had memory limits set,
and in fact some of them (but far from all) were getting OOM killed, so we
have fully removed those limits again.  Unfortunately this hasn't helped
much and there are still 50ish OSDs down.  We're now experimenting on one
of the down OSDs with "ceph-bluestore-tool --command fsck --deep", followed
by "ceph-objectstore-tool --op fsck".  Those finished successfully but it
hasn't resulted in any fixes.

We've noticed messages like these in dmesg, but don't know what to make of
them yet.  Could these be indicative of a problem, or are they part of
normal operation?

----begin paste----
[Tue Jan 25 19:39:08 2022] libceph: wrong peer, want (1)
10.6.168.17:6825/-1988778847, got (1)0.0.0.0:6825/1335335775
[Tue Jan 25 19:39:08 2022] libceph: osd32 (1)10.6.168.17:6825 wrong peer at
address
[Tue Jan 25 19:39:09 2022] libceph: wrong peer, want (1)
10.6.168.17:6825/-1988778847, got (1)10.6.168.17:6825/1335335775
[Tue Jan 25 19:39:09 2022] libceph: osd32 (1)10.6.168.17:6825 wrong peer at
address
[Tue Jan 25 19:39:11 2022] libceph: wrong peer, want (1)
10.6.168.17:6825/-1988778847, got (1)10.6.168.17:6825/1335335775
[Tue Jan 25 19:39:11 2022] libceph: osd32 (1)10.6.168.17:6825 wrong peer at
address
[Tue Jan 25 19:39:15 2022] libceph: wrong peer, want (1)
10.6.168.17:6825/-1988778847, got (1)10.6.168.17:6825/1335335775
[Tue Jan 25 19:39:15 2022] libceph: osd32 (1)10.6.168.17:6825 wrong peer at
address
[Tue Jan 25 20:04:58 2022] libceph: wrong peer, want (1)
10.6.168.17:6809/779850463, got (1)0.0.0.0:6809/-398783580
[Tue Jan 25 20:04:58 2022] libceph: osd34 (1)10.6.168.17:6809 wrong peer at
address
[Tue Jan 25 20:04:59 2022] libceph: wrong peer, want (1)
10.6.168.17:6809/779850463, got (1)10.6.168.17:6809/-398783580
[Tue Jan 25 20:04:59 2022] libceph: osd34 (1)10.6.168.17:6809 wrong peer at
address
[Tue Jan 25 20:32:49 2022] libceph: wrong peer, want (1)
10.6.168.11:6833/-483515092, got (1)0.0.0.0:6833/1446518624
[Tue Jan 25 20:32:49 2022] libceph: osd74 (1)10.6.168.11:6833 wrong peer at
address
[Tue Jan 25 20:32:50 2022] libceph: wrong peer, want (1)
10.6.168.11:6833/-483515092, got (1)10.6.168.11:6833/1446518624
[Tue Jan 25 20:32:50 2022] libceph: osd74 (1)10.6.168.11:6833 wrong peer at
address
----end paste----

(also see inline replies below)

On Tue, Jan 25, 2022 at 10:51 AM Dan van der Ster <dvanders@xxxxxxxxx>
wrote:

> On Tue, Jan 25, 2022 at 4:07 PM Frank Schilder <frans@xxxxxx> wrote:
> >
> > Hi Dan,
> >
> > in several threads I have now seen statements like "Does your cluster
> have the pglog_hardlimit set?". In this context, I would be grateful if you
> could shed some light on the following:
> >
> > 1) How do I check that?
> >
> > There is no equivalent "osd get pglog_hardlimit".
>
> I showed how to query for it:
>
> # ceph osd dump | grep pglog
> flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit

Yes, pglog_hardlimit is enabled:

$ ceph osd dump|grep pglog
flags
norebalance,sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit

> > 2) What is the recommendation?
>
> Since pacific it should be on by default, but I haven't had any user
> confirm this fact.
> (On our clusters we have enabled it manually when it was added to
> nautilus).
>
> > In the ceph documentation, the only occurrence of the term
> pglog_hardlimit are release notes for luminous and mimic, stating (mimic)
> >
> > > A flag called pglog_hardlimit has been introduced, which is off by
> default. Enabling this flag will limit the
> > > length of the pg log. In order to enable that, the flag must be set by
> running ceph osd set pglog_hardlimit
> > > after completely upgrading to 13.2.2. Once the cluster has this flag
> set, the length of the pg log will be
> > > capped by a hard limit. Once set, this flag must not be unset anymore.
> In luminous, this feature was
> > > introduced in 12.2.11. Users who are running 12.2.11, and want to
> continue to use this feature, should
> > > upgrade to 13.2.5 or later.
> >
> > How do I know if I want to use this feature? I would need a bit of
> information about pros and cons. Or should one have this enabled in any
> case? Would be great if you could provide some insight here.
>
> Normally a pg log with even 10000 entries consumes just a couple
> hundred MBs of memory. (See the osd_pglog mempool).
> The pg log length can be queried like I showed earlier:
>
> # ceph pg dump | grep + | awk '{print $10, $11, $12}' | sort -n | tail
>
> (those are the LOG colums in the pg output).
>

It seems we don't have any above 2800 entries:

$ ceph pg dump | grep + | awk '{print $10, $11, $12}' | sort -n | tail
dumped all
2682 2682 undersized+degraded+peered
2683 2683 down
2685 2685 down
2704 2704 down
2704 2704 down
2710 2710 active+undersized+degraded
2714 2714 down
2726 2726 active+undersized+degraded
2735 2735 undersized+degraded+peered
2737 2737 down

> In the past I've seen pg logs with millions of entries. Those are
> surely a root cause for huge memory usage, especially at OSD boot
> time.
> Such pglogs would need to be trimmed, e.g. with the
> ceph-objectstore-tool recipes that have been shared around on the
> list.
> The pglog_hardlimit is meant to limit the growth of the PG log.
>
> On the other hand: it is clear that even with reasonably sized PG
> logs, the memory can balloon for some unknown reason.
> The devs have asked a couple times for dumps of those logs replaying
> huge-memory causing pglogs.
>
> In this case -- Benjamin's issue -- I'm trying to understand if this
> is related to:
> * a huge pg log -- would need trimming -- perhaps the pglog_hardlimit
> isn't on by default as designed
> * normal sized pg log, with some entries that are consuming huge
> amounts of memory (due to a yet-unsolved bug).
>
> Thanks,
> Dan
>
>
> > From: Dan van der Ster <dvanders@xxxxxxxxx>
> > Sent: 25 January 2022 11:56:38
> > To: Benjamin Staffin
> > Cc: Ceph Users; Matthew Wilder; Tara Fly
> > Subject:  Re: Fwd: Lots of OSDs crashlooping (DRAFT -
> feedback?)
> >
> > Hi Benjamin,
> >
> > Apologies that I can't help for the bluestore issue.
> >
> > But that huge 100GB OSD consumption could be related to similar
> > reports linked here: https://tracker.ceph.com/issues/53729
> >
> > Does your cluster have the pglog_hardlimit set?
> >
> > # ceph osd dump | grep pglog
> > flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
> >
> > Do you have PGs with really long pglogs?
> >
> > # ceph pg dump | grep + | awk '{print $10, $11, $12}' | sort -n | tail
> >
> >
> >
> > -- Dan
> >
> > On Tue, Jan 25, 2022 at 12:44 AM Benjamin Staffin
> > <bstaffin@xxxxxxxxxxxxxxx> wrote:
> > >
> > > I have a cluster where 46 out of 120 OSDs have begun crash looping
> with the
> > > same stack trace (see pasted output below).  The cluster is in a very
> bad
> > > state with this many OSDs down, unsurprisingly.
> > >
> > > The day before this problem showed up, the k8s cluster was under
> extreme
> > > memory pressure and a lot of pods were OOM killed, including some of
> the
> > > Ceph OSDs, but after the memory pressure abated everything seemed to
> > > stabilize for about a day.
> > >
> > > Then we attempted to set a 4gb memory limit on the OSD pods, because
> they
> > > had been using upwards of 100gb of ram(!) per OSD after about a month
> of
> > > uptime, and this was a contributing factor in the cluster-wide OOM
> > > situation.  Everything seemed fine for a few minutes after Rook rolled
> out
> > > the memory limit, but then OSDs gradually started to crash, a few at a
> > > time, up to about 30 of them.  At this point I reverted the memory
> limit,
> > > but I don't think the OSDs were hitting their memory limits at all.
> In an
> > > attempt to stabilize the cluster, we eventually the Rook operator and
> set
> > > the osd norebalance, nobackfill, noout, and norecover flags, but at
> this
> > > point there were 46 OSDs down and pools were hitting BackFillFull.
> > >
> > > This is a Rook-ceph deployment on bare-metal kubernetes cluster of 12
> > > nodes.  Each node has two 7TiB nvme disks dedicated to Ceph, and we
> have 5
> > > BlueStore OSDs per nvme disk (so around 1.4TiB per OSD, which ough to
> be
> > > fine with a 4gb memory target, right?).  The crash we're seeing looks
> very
> > > much like the one in this bug report:
> https://tracker.ceph.com/issues/52220
> > >
> > > I don't know how to proceed from here, so any advice would be very much
> > > appreciated.
> > >
> > > Ceph version: 16.2.6
> > > Rook version: 1.7.6
> > > Kubernetes version: 1.21.5
> > > Kernel version: 5.4.156-1.el7.elrepo.x86_64
> > > Distro: CentOS 7.9
> > >
> > > I've also attached the full log output from one of the crashing OSDs,
> in
> > > case that is of any use.
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx