Re: Slow OSD startup and slow ops

Stefan Kooman <stefan@xxxxxx> · Thu, 22 Sep 2022 17:54:43 +0200

Hi,

On 9/21/22 18:00, Gauvain Pocentek wrote:
Hello all,

We are running several Ceph clusters and are facing an issue on one of
them, we would appreciate some input on the problems we're seeing.

We run Ceph in containers on Centos Stream 8, and we deploy using
ceph-ansible. While upgrading ceph from 16.2.7 to 16.2.10, we noticed that
OSDs were taking a very long time to restart on one of the clusters. (Other
clusters were not impacted at all.) 

Are the other clusters of similar size?

The OSD startup was so slow sometimes
that we ended up having slow ops, with 1 or 2 pg stuck in a peering state.
We've interrupted the upgrade and the cluster runs fine now, although we
have seen 1 OSD flapping recently, having trouble coming back to life.

We've checked a lot of things and read a lot of mails from this list, and
here are some info:

* this cluster has RBD pools for OpenStack and RGW pools; everything is
replicated x 3, except the RGW data pool which is EC 4+2
* we haven't found any hardware related issues; we run fully on SSDs and
they are all in good shape, no network issue, RAM and CPU are available on
all OSD hosts
* bluestore with an LVM collocated setup
* we have seen the slow restart with almost all the OSDs we've upgraded
(100 out of 350)
* on restart the ceph-osd process runs at 100% CPU but we haven't seen
anything weird on the host

Are the containers restricted to use a certain amount of CPU? Do the 
OSDs, after ~ 10-20 seconds increase their CPU usage to 200% (if so this 
is proably because of rocksdb option max_background_compactions = 2).

* no DB spillover
* we have other clusters with the same hardware, and we don't see problems
there

The only thing that we found that looks suspicious is the number of op logs
for the PGs of the RGW index pool. `osd_max_pg_log_entries` is set to 10k
but `ceph pg dump` show PGs with more than 100k logs (the largest one has >
400k logs).

Could this be the reason for the slow startup of OSDs? If so is there a way
to trim these logs without too much impact on the cluster?

Not sure. We have ~ 2K logs per PG.

Let me know if additional info or logs are needed.

Do you have a log of slow ops and osd logs?

Do you have any non-standard configuration for the daemons? I.e. ceph 
daemon osd.$id config diff

We are running a Ceph Octopus (15.2.16) cluster with similar 
configuration. We have *a lot* of slow ops when starting OSDs. Also 
during peering. When the OSDs start they consume 100% CPU for up to ~ 10 
seconds, and after that consume 200% for a minute or more. During that 
time the OSDs perform a compaction. You should be able to find this in 
the OSD logs if it's the same in your case. After some the OSDs are done 
initializing and starting the boot process. As soon as they boot up and 
start peering the slow ops start to kick in. Lot's of "transitioning to 
Primary" and "transitioning to Stray" logging. Some time later the OSD 
becomes "active". While the OSD is busy with peering it's also busy 
compacting. As I also see RocksDB compaction logging. So it might be due 
to RocksDB compactions impacting OSD performance while it's already busy 
becoming primary (and or secondary / tertiary) for it's PGs.

We had norecover, nobackfill, norebalance active when booting the OSDs.

So, it might just take a long time to do RocksDB compaction. In this 
case it might be better to do all needed RocksDB compactions, and then 
start booting. So, what might help is to set "ceph osd set noup". This 
prevents the OSD from becoming active, then wait for the RocksDB 
compactions, and after that unset the flag.

If you try this, please let me know how it goes.

Gr. Stefan

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx