Hello Stefan, Thank you for your answers. On Thu, Sep 22, 2022 at 5:54 PM Stefan Kooman <stefan@xxxxxx> wrote: > Hi, > > On 9/21/22 18:00, Gauvain Pocentek wrote: > > Hello all, > > > > We are running several Ceph clusters and are facing an issue on one of > > them, we would appreciate some input on the problems we're seeing. > > > > We run Ceph in containers on Centos Stream 8, and we deploy using > > ceph-ansible. While upgrading ceph from 16.2.7 to 16.2.10, we noticed > that > > OSDs were taking a very long time to restart on one of the clusters. > (Other > > clusters were not impacted at all.) > > Are the other clusters of similar size? > We have at least one cluster that is roughly the same size. It has not been upgraded yet but restarting the OSDs doesn't create any issues. > The OSD startup was so slow sometimes > > that we ended up having slow ops, with 1 or 2 pg stuck in a peering > state. > > We've interrupted the upgrade and the cluster runs fine now, although we > > have seen 1 OSD flapping recently, having trouble coming back to life. > > > > We've checked a lot of things and read a lot of mails from this list, and > > here are some info: > > > > * this cluster has RBD pools for OpenStack and RGW pools; everything is > > replicated x 3, except the RGW data pool which is EC 4+2 > > * we haven't found any hardware related issues; we run fully on SSDs and > > they are all in good shape, no network issue, RAM and CPU are available > on > > all OSD hosts > > * bluestore with an LVM collocated setup > > * we have seen the slow restart with almost all the OSDs we've upgraded > > (100 out of 350) > > * on restart the ceph-osd process runs at 100% CPU but we haven't seen > > anything weird on the host > > Are the containers restricted to use a certain amount of CPU? Do the > OSDs, after ~ 10-20 seconds increase their CPU usage to 200% (if so this > is proably because of rocksdb option max_background_compactions = 2). > This is actually a good point. We run the containers with --cpus=2. We also had a couple incidents were OSDs started to act up on nodes were VMs were running CPU intensive workloads (we have a hyperconverged setup with OpenStack). So there's definitely something going on there. I haven't had the opportunity to do a new restart to check more about the CPU usage, but I hope to do that this week. > > > * no DB spillover > > * we have other clusters with the same hardware, and we don't see > problems > > there > > > > The only thing that we found that looks suspicious is the number of op > logs > > for the PGs of the RGW index pool. `osd_max_pg_log_entries` is set to 10k > > but `ceph pg dump` show PGs with more than 100k logs (the largest one > has > > > 400k logs). > > > > Could this be the reason for the slow startup of OSDs? If so is there a > way > > to trim these logs without too much impact on the cluster? > > Not sure. We have ~ 2K logs per PG. > > > > > Let me know if additional info or logs are needed. > > Do you have a log of slow ops and osd logs? > I will get more logs when I restart an OSD this week. What log levels for bluestore/rocksdb would you recommend? > > Do you have any non-standard configuration for the daemons? I.e. ceph > daemon osd.$id config diff > Nothing non-standard. > > We are running a Ceph Octopus (15.2.16) cluster with similar > configuration. We have *a lot* of slow ops when starting OSDs. Also > during peering. When the OSDs start they consume 100% CPU for up to ~ 10 > seconds, and after that consume 200% for a minute or more. During that > time the OSDs perform a compaction. You should be able to find this in > the OSD logs if it's the same in your case. After some the OSDs are done > initializing and starting the boot process. As soon as they boot up and > start peering the slow ops start to kick in. Lot's of "transitioning to > Primary" and "transitioning to Stray" logging. Some time later the OSD > becomes "active". While the OSD is busy with peering it's also busy > compacting. As I also see RocksDB compaction logging. So it might be due > to RocksDB compactions impacting OSD performance while it's already busy > becoming primary (and or secondary / tertiary) for it's PGs. > > We had norecover, nobackfill, norebalance active when booting the OSDs. > > So, it might just take a long time to do RocksDB compaction. In this > case it might be better to do all needed RocksDB compactions, and then > start booting. So, what might help is to set "ceph osd set noup". This > prevents the OSD from becoming active, then wait for the RocksDB > compactions, and after that unset the flag. > > If you try this, please let me know how it goes. > That sounds like a good thing to try, I'll keep you posted. Thanks again, Gauvain _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx