On 9/26/22 18:04, Gauvain Pocentek wrote:
We are running a Ceph Octopus (15.2.16) cluster with similar configuration. We have *a lot* of slow ops when starting OSDs. Also during peering. When the OSDs start they consume 100% CPU for up to ~ 10 seconds, and after that consume 200% for a minute or more. During that time the OSDs perform a compaction. You should be able to find this in the OSD logs if it's the same in your case. After some the OSDs are done initializing and starting the boot process. As soon as they boot up and start peering the slow ops start to kick in. Lot's of "transitioning to Primary" and "transitioning to Stray" logging. Some time later the OSD becomes "active". While the OSD is busy with peering it's also busy compacting. As I also see RocksDB compaction logging. So it might be due to RocksDB compactions impacting OSD performance while it's already busy becoming primary (and or secondary / tertiary) for it's PGs. We had norecover, nobackfill, norebalance active when booting the OSDs. So, it might just take a long time to do RocksDB compaction. In this case it might be better to do all needed RocksDB compactions, and then start booting. So, what might help is to set "ceph osd set noup". This prevents the OSD from becoming active, then wait for the RocksDB compactions, and after that unset the flag. If you try this, please let me know how it goes.
Last night we had storage switch maintenance. We turned off 2/3 of the cluster and back on (one failure domain at a time). We used the "noup" flag to prevent the OSDs from booting. Waited for ~ 10 minutes. That was the time it took for the last OSD to finish it's RocksDB compactions. At that point we unset the "noup" flag and allmost all OSDs came back online instantly. This resulted in some slows ops, but ~ 30 times less than before, and only for ~ 5 seconds. With a bit more planning you can set the "noup" flag to individual OSDs. And then, in a loop with some sleep, unset it per OSD. This might give less stress during peering. This is however micro management. Ideally this "noup" step should not be needed at all. The, maybe naive solution, would be to have the OSD refrain itself from becoming active when it's in the bootup phase and busy going through a whole batch of RocksDB compaction events. I'm CC-ing Igor to see if he can comment on this.
@Gauvain: Compared to your other clusters, does this cluster has more Ceph services running that the others don't? Your other clusters might have *way* less OMAP/metadata than the cluster giving you issues.
Gr. Stefan _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx