Re: Slow OSD startup and slow ops

Stefan Kooman <stefan@xxxxxx> · Thu, 29 Sep 2022 10:28:57 +0200

On 9/26/22 18:04, Gauvain Pocentek wrote:

    We are running a Ceph Octopus (15.2.16) cluster with similar
    configuration. We have *a lot* of slow ops when starting OSDs. Also
    during peering. When the OSDs start they consume 100% CPU for up to
    ~ 10
    seconds, and after that consume 200% for a minute or more. During that
    time the OSDs perform a compaction. You should be able to find this in
    the OSD logs if it's the same in your case. After some the OSDs are
    done
    initializing and starting the boot process. As soon as they boot up and
    start peering the slow ops start to kick in. Lot's of "transitioning to
    Primary" and "transitioning to Stray" logging. Some time later the OSD
    becomes "active". While the OSD is busy with peering it's also busy
    compacting. As I also see RocksDB compaction logging. So it might be
    due
    to RocksDB compactions impacting OSD performance while it's already
    busy
    becoming primary (and or secondary / tertiary) for it's PGs.

    We had norecover, nobackfill, norebalance active when booting the OSDs.

    So, it might just take a long time to do RocksDB compaction. In this
    case it might be better to do all needed RocksDB compactions, and then
    start booting. So, what might help is to set "ceph osd set noup". This
    prevents the OSD from becoming active, then wait for the RocksDB
    compactions, and after that unset the flag.

    If you try this, please let me know how it goes.

Last night we had storage switch maintenance. We turned off 2/3 of the 
cluster and back on (one failure domain at a time). We used the "noup" 
flag to prevent the OSDs from booting. Waited for ~ 10 minutes. That was 
the time it took for the last OSD to finish it's RocksDB compactions. At 
that point we unset the "noup" flag and allmost all OSDs came back 
online instantly. This resulted in some slows ops, but ~ 30 times less 
than before, and only for ~ 5 seconds. With a bit more planning you can 
set the "noup" flag to individual OSDs. And then, in a loop with some 
sleep, unset it per OSD. This might give less stress during peering. 
This is however micro management. Ideally this "noup" step should not be 
needed at all. The, maybe naive solution, would be to have the OSD 
refrain itself from becoming active when it's in the bootup phase and 
busy going through a whole batch of RocksDB compaction events. I'm 
CC-ing Igor to see if he can comment on this.

@Gauvain: Compared to your other clusters, does this cluster has more 
Ceph services running that the others don't? Your other clusters might 
have *way* less OMAP/metadata than the cluster giving you issues.

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx