Hi,
On 9/21/22 18:00, Gauvain Pocentek wrote:
Hello all,
We are running several Ceph clusters and are facing an issue on one of
them, we would appreciate some input on the problems we're seeing.
We run Ceph in containers on Centos Stream 8, and we deploy using
ceph-ansible. While upgrading ceph from 16.2.7 to 16.2.10, we noticed that
OSDs were taking a very long time to restart on one of the clusters. (Other
clusters were not impacted at all.)
Are the other clusters of similar size?
The OSD startup was so slow sometimes
that we ended up having slow ops, with 1 or 2 pg stuck in a peering state.
We've interrupted the upgrade and the cluster runs fine now, although we
have seen 1 OSD flapping recently, having trouble coming back to life.
We've checked a lot of things and read a lot of mails from this list, and
here are some info:
* this cluster has RBD pools for OpenStack and RGW pools; everything is
replicated x 3, except the RGW data pool which is EC 4+2
* we haven't found any hardware related issues; we run fully on SSDs and
they are all in good shape, no network issue, RAM and CPU are available on
all OSD hosts
* bluestore with an LVM collocated setup
* we have seen the slow restart with almost all the OSDs we've upgraded
(100 out of 350)
* on restart the ceph-osd process runs at 100% CPU but we haven't seen
anything weird on the host
Are the containers restricted to use a certain amount of CPU? Do the
OSDs, after ~ 10-20 seconds increase their CPU usage to 200% (if so this
is proably because of rocksdb option max_background_compactions = 2).
* no DB spillover
* we have other clusters with the same hardware, and we don't see problems
there
The only thing that we found that looks suspicious is the number of op logs
for the PGs of the RGW index pool. `osd_max_pg_log_entries` is set to 10k
but `ceph pg dump` show PGs with more than 100k logs (the largest one has >
400k logs).
Could this be the reason for the slow startup of OSDs? If so is there a way
to trim these logs without too much impact on the cluster?
Not sure. We have ~ 2K logs per PG.
Let me know if additional info or logs are needed.
Do you have a log of slow ops and osd logs?
Do you have any non-standard configuration for the daemons? I.e. ceph
daemon osd.$id config diff
We are running a Ceph Octopus (15.2.16) cluster with similar
configuration. We have *a lot* of slow ops when starting OSDs. Also
during peering. When the OSDs start they consume 100% CPU for up to ~ 10
seconds, and after that consume 200% for a minute or more. During that
time the OSDs perform a compaction. You should be able to find this in
the OSD logs if it's the same in your case. After some the OSDs are done
initializing and starting the boot process. As soon as they boot up and
start peering the slow ops start to kick in. Lot's of "transitioning to
Primary" and "transitioning to Stray" logging. Some time later the OSD
becomes "active". While the OSD is busy with peering it's also busy
compacting. As I also see RocksDB compaction logging. So it might be due
to RocksDB compactions impacting OSD performance while it's already busy
becoming primary (and or secondary / tertiary) for it's PGs.
We had norecover, nobackfill, norebalance active when booting the OSDs.
So, it might just take a long time to do RocksDB compaction. In this
case it might be better to do all needed RocksDB compactions, and then
start booting. So, what might help is to set "ceph osd set noup". This
prevents the OSD from becoming active, then wait for the RocksDB
compactions, and after that unset the flag.
If you try this, please let me know how it goes.
Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx