Hi Stefan, Thanks for your feedback! On Thu, Sep 29, 2022 at 10:28 AM Stefan Kooman <stefan@xxxxxx> wrote: > On 9/26/22 18:04, Gauvain Pocentek wrote: > > > > > > > We are running a Ceph Octopus (15.2.16) cluster with similar > > configuration. We have *a lot* of slow ops when starting OSDs. Also > > during peering. When the OSDs start they consume 100% CPU for up to > > ~ 10 > > seconds, and after that consume 200% for a minute or more. During > that > > time the OSDs perform a compaction. You should be able to find this > in > > the OSD logs if it's the same in your case. After some the OSDs are > > done > > initializing and starting the boot process. As soon as they boot up > and > > start peering the slow ops start to kick in. Lot's of "transitioning > to > > Primary" and "transitioning to Stray" logging. Some time later the > OSD > > becomes "active". While the OSD is busy with peering it's also busy > > compacting. As I also see RocksDB compaction logging. So it might be > > due > > to RocksDB compactions impacting OSD performance while it's already > > busy > > becoming primary (and or secondary / tertiary) for it's PGs. > > > > We had norecover, nobackfill, norebalance active when booting the > OSDs. > > > > So, it might just take a long time to do RocksDB compaction. In this > > case it might be better to do all needed RocksDB compactions, and > then > > start booting. So, what might help is to set "ceph osd set noup". > This > > prevents the OSD from becoming active, then wait for the RocksDB > > compactions, and after that unset the flag. > > > > If you try this, please let me know how it goes. > > Last night we had storage switch maintenance. We turned off 2/3 of the > cluster and back on (one failure domain at a time). We used the "noup" > flag to prevent the OSDs from booting. Waited for ~ 10 minutes. That was > the time it took for the last OSD to finish it's RocksDB compactions. At > that point we unset the "noup" flag and allmost all OSDs came back > online instantly. This resulted in some slows ops, but ~ 30 times less > than before, and only for ~ 5 seconds. With a bit more planning you can > set the "noup" flag to individual OSDs. And then, in a loop with some > sleep, unset it per OSD. This might give less stress during peering. > This is however micro management. Ideally this "noup" step should not be > needed at all. The, maybe naive solution, would be to have the OSD > refrain itself from becoming active when it's in the bootup phase and > busy going through a whole batch of RocksDB compaction events. I'm > CC-ing Igor to see if he can comment on this. > > @Gauvain: Compared to your other clusters, does this cluster has more > Ceph services running that the others don't? Your other clusters might > have *way* less OMAP/metadata than the cluster giving you issues. > This cluster runs the same services as other clusters. It looks like we are hitting this bug: https://tracker.ceph.com/issues/53729. There seems to be a lot of duplicated op logs (I'm still trying to understand what that really is), huge memory usage (which hasn't been a problem because of the size of our servers, we have a lot of RAM), and so far no way to clean that online with Pacific. This blog post explains very clearly how to check if you are impacted: https://www.clyso.com/blog/osds-with-unlimited-ram-growth/ All the clusters seem to be impacted, but that specific one shows worse signs. We are now looking into the offline cleanup. We're taking a lot of precautions because this is a production cluster and the problems have already impacted users. Gauvain _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx