Re: Slow OSD startup and slow ops

Gauvain Pocentek <gauvainpocentek@xxxxxxxxx> · Fri, 30 Sep 2022 08:12:29 +0200

Hi Stefan,

Thanks for your feedback!

On Thu, Sep 29, 2022 at 10:28 AM Stefan Kooman <stefan@xxxxxx> wrote:

> On 9/26/22 18:04, Gauvain Pocentek wrote:
>
> >
> >
> >     We are running a Ceph Octopus (15.2.16) cluster with similar
> >     configuration. We have *a lot* of slow ops when starting OSDs. Also
> >     during peering. When the OSDs start they consume 100% CPU for up to
> >     ~ 10
> >     seconds, and after that consume 200% for a minute or more. During
> that
> >     time the OSDs perform a compaction. You should be able to find this
> in
> >     the OSD logs if it's the same in your case. After some the OSDs are
> >     done
> >     initializing and starting the boot process. As soon as they boot up
> and
> >     start peering the slow ops start to kick in. Lot's of "transitioning
> to
> >     Primary" and "transitioning to Stray" logging. Some time later the
> OSD
> >     becomes "active". While the OSD is busy with peering it's also busy
> >     compacting. As I also see RocksDB compaction logging. So it might be
> >     due
> >     to RocksDB compactions impacting OSD performance while it's already
> >     busy
> >     becoming primary (and or secondary / tertiary) for it's PGs.
> >
> >     We had norecover, nobackfill, norebalance active when booting the
> OSDs.
> >
> >     So, it might just take a long time to do RocksDB compaction. In this
> >     case it might be better to do all needed RocksDB compactions, and
> then
> >     start booting. So, what might help is to set "ceph osd set noup".
> This
> >     prevents the OSD from becoming active, then wait for the RocksDB
> >     compactions, and after that unset the flag.
> >
> >     If you try this, please let me know how it goes.
>
> Last night we had storage switch maintenance. We turned off 2/3 of the
> cluster and back on (one failure domain at a time). We used the "noup"
> flag to prevent the OSDs from booting. Waited for ~ 10 minutes. That was
> the time it took for the last OSD to finish it's RocksDB compactions. At
> that point we unset the "noup" flag and allmost all OSDs came back
> online instantly. This resulted in some slows ops, but ~ 30 times less
> than before, and only for ~ 5 seconds. With a bit more planning you can
> set the "noup" flag to individual OSDs. And then, in a loop with some
> sleep, unset it per OSD. This might give less stress during peering.
> This is however micro management. Ideally this "noup" step should not be
> needed at all. The, maybe naive solution, would be to have the OSD
> refrain itself from becoming active when it's in the bootup phase and
> busy going through a whole batch of RocksDB compaction events. I'm
> CC-ing Igor to see if he can comment on this.
>
> @Gauvain: Compared to your other clusters, does this cluster has more
> Ceph services running that the others don't? Your other clusters might
> have *way* less OMAP/metadata than the cluster giving you issues.
>

This cluster runs the same services as other clusters.

It looks like we are hitting this bug: https://tracker.ceph.com/issues/53729.
There seems to be a lot of duplicated op logs (I'm still trying to
understand what that really is), huge memory usage (which hasn't been a
problem because of the size of our servers, we have a lot of RAM), and so
far no way to clean that online with Pacific. This blog post explains very
clearly how to check if you are impacted:
https://www.clyso.com/blog/osds-with-unlimited-ram-growth/

All the clusters seem to be impacted, but that specific one shows worse
signs.

We are now looking into the offline cleanup. We're taking a lot of
precautions because this is a production cluster and the problems have
already impacted users.

Gauvain
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx