Re: Slow OSD startup and slow ops

Gauvain Pocentek <gauvainpocentek@xxxxxxxxx> · Mon, 17 Oct 2022 11:15:26 +0200

Hello,

On Fri, Sep 30, 2022 at 8:12 AM Gauvain Pocentek <gauvainpocentek@xxxxxxxxx>
wrote:

> Hi Stefan,
>
> Thanks for your feedback!
>
>
> On Thu, Sep 29, 2022 at 10:28 AM Stefan Kooman <stefan@xxxxxx> wrote:
>
>> On 9/26/22 18:04, Gauvain Pocentek wrote:
>>
>> >
>> >
>> >     We are running a Ceph Octopus (15.2.16) cluster with similar
>> >     configuration. We have *a lot* of slow ops when starting OSDs. Also
>> >     during peering. When the OSDs start they consume 100% CPU for up to
>> >     ~ 10
>> >     seconds, and after that consume 200% for a minute or more. During
>> that
>> >     time the OSDs perform a compaction. You should be able to find this
>> in
>> >     the OSD logs if it's the same in your case. After some the OSDs are
>> >     done
>> >     initializing and starting the boot process. As soon as they boot up
>> and
>> >     start peering the slow ops start to kick in. Lot's of
>> "transitioning to
>> >     Primary" and "transitioning to Stray" logging. Some time later the
>> OSD
>> >     becomes "active". While the OSD is busy with peering it's also busy
>> >     compacting. As I also see RocksDB compaction logging. So it might be
>> >     due
>> >     to RocksDB compactions impacting OSD performance while it's already
>> >     busy
>> >     becoming primary (and or secondary / tertiary) for it's PGs.
>> >
>> >     We had norecover, nobackfill, norebalance active when booting the
>> OSDs.
>> >
>> >     So, it might just take a long time to do RocksDB compaction. In this
>> >     case it might be better to do all needed RocksDB compactions, and
>> then
>> >     start booting. So, what might help is to set "ceph osd set noup".
>> This
>> >     prevents the OSD from becoming active, then wait for the RocksDB
>> >     compactions, and after that unset the flag.
>> >
>> >     If you try this, please let me know how it goes.
>>
>> Last night we had storage switch maintenance. We turned off 2/3 of the
>> cluster and back on (one failure domain at a time). We used the "noup"
>> flag to prevent the OSDs from booting. Waited for ~ 10 minutes. That was
>> the time it took for the last OSD to finish it's RocksDB compactions. At
>> that point we unset the "noup" flag and allmost all OSDs came back
>> online instantly. This resulted in some slows ops, but ~ 30 times less
>> than before, and only for ~ 5 seconds. With a bit more planning you can
>> set the "noup" flag to individual OSDs. And then, in a loop with some
>> sleep, unset it per OSD. This might give less stress during peering.
>> This is however micro management. Ideally this "noup" step should not be
>> needed at all. The, maybe naive solution, would be to have the OSD
>> refrain itself from becoming active when it's in the bootup phase and
>> busy going through a whole batch of RocksDB compaction events. I'm
>> CC-ing Igor to see if he can comment on this.
>>
>> @Gauvain: Compared to your other clusters, does this cluster has more
>> Ceph services running that the others don't? Your other clusters might
>> have *way* less OMAP/metadata than the cluster giving you issues.
>>
>
> This cluster runs the same services as other clusters.
>
> It looks like we are hitting this bug:
> https://tracker.ceph.com/issues/53729. There seems to be a lot of
> duplicated op logs (I'm still trying to understand what that really is),
> huge memory usage (which hasn't been a problem because of the size of our
> servers, we have a lot of RAM), and so far no way to clean that online with
> Pacific. This blog post explains very clearly how to check if you are
> impacted: https://www.clyso.com/blog/osds-with-unlimited-ram-growth/
>
> All the clusters seem to be impacted, but that specific one shows worse
> signs.
>
> We are now looking into the offline cleanup. We're taking a lot of
> precautions because this is a production cluster and the problems have
> already impacted users.
>

After more analysis and testing we are definitely hitting the oplog dups
bug. That is causing the very slow startup of OSDs and some instabilities
on the cluster.

Since the fix is not yet released for Pacific we have started to manually
cleanup the OSDs.

We have compiled ceph-objectstore-tool from the pacific branch of the git
repo to get the `--op trim-pg-log-dups` feature and we're now running this
on all the OSDs:

(mon) for i in norebalance norecover nobackfill; do ceph osd set $i; done

systemctl stop ceph-osd@$OSD_ID
# we're only dealing with PGs of pool 2, turns out this is where we have
the biggest problems
/opt/ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$OSD_ID/ --op
list-pgs | grep ^2 > /tmp/pgs.txt
while read pg; do
    /opt/ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$OSD_ID/
--op trim-pg-log-dups --pgid $pg --osd_max_pg_log_entries=100
--osd_pg_log_dups_tracked=100 --osd_pg_log_trim_max=500000
done < /tmp/pgs.txt
systemctl start ceph-osd@$OSD_ID

(mon) for i in norebalance norecover nobackfill; do ceph osd unset $i; done

`--osd_pg_log_trim_max=500000` is important for speed, the default is 10000
and this makes the process super slow. RAM usage of the process went up to
5G with 500000.

I hope this can help other people having the same issue.

Gauvain

>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx