Re: Slow OSD startup and slow ops

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 9/26/22 18:04, Gauvain Pocentek wrote:



    We are running a Ceph Octopus (15.2.16) cluster with similar
    configuration. We have *a lot* of slow ops when starting OSDs. Also
    during peering. When the OSDs start they consume 100% CPU for up to
    ~ 10
    seconds, and after that consume 200% for a minute or more. During that
    time the OSDs perform a compaction. You should be able to find this in
    the OSD logs if it's the same in your case. After some the OSDs are
    done
    initializing and starting the boot process. As soon as they boot up and
    start peering the slow ops start to kick in. Lot's of "transitioning to
    Primary" and "transitioning to Stray" logging. Some time later the OSD
    becomes "active". While the OSD is busy with peering it's also busy
    compacting. As I also see RocksDB compaction logging. So it might be
    due
    to RocksDB compactions impacting OSD performance while it's already
    busy
    becoming primary (and or secondary / tertiary) for it's PGs.

    We had norecover, nobackfill, norebalance active when booting the OSDs.

    So, it might just take a long time to do RocksDB compaction. In this
    case it might be better to do all needed RocksDB compactions, and then
    start booting. So, what might help is to set "ceph osd set noup". This
    prevents the OSD from becoming active, then wait for the RocksDB
    compactions, and after that unset the flag.

    If you try this, please let me know how it goes.

Last night we had storage switch maintenance. We turned off 2/3 of the cluster and back on (one failure domain at a time). We used the "noup" flag to prevent the OSDs from booting. Waited for ~ 10 minutes. That was the time it took for the last OSD to finish it's RocksDB compactions. At that point we unset the "noup" flag and allmost all OSDs came back online instantly. This resulted in some slows ops, but ~ 30 times less than before, and only for ~ 5 seconds. With a bit more planning you can set the "noup" flag to individual OSDs. And then, in a loop with some sleep, unset it per OSD. This might give less stress during peering. This is however micro management. Ideally this "noup" step should not be needed at all. The, maybe naive solution, would be to have the OSD refrain itself from becoming active when it's in the bootup phase and busy going through a whole batch of RocksDB compaction events. I'm CC-ing Igor to see if he can comment on this.

@Gauvain: Compared to your other clusters, does this cluster has more Ceph services running that the others don't? Your other clusters might have *way* less OMAP/metadata than the cluster giving you issues.

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux