Re: slow operation observed for _collection_list

Boris Behrens <bb@xxxxxxxxx> · Wed, 10 Nov 2021 17:41:54 +0100

Did someone figure this out?
We are currently facing the same issue but the OSDs more often kill
themself and need to be restarted by us.

This happens to OSDs that have a SSD backed block.db and OSDs that got the
block.db on the bluestore device.
All OSDs are rotating disk of various sizes.
We've disable (deep)scrubbing to know if this creates the issue.
It happens on centos7 and ubuntu focal hosts.
It only appeared after we switched from latest nautilus to latest octopus.
(in the upgrade process we've got rid of the cluster_network option (it was
only a separate VLAN) and a lot of other config variables, so we are mostly
on default values)

And even an OSD I added 6hrs ago (one of 20) does got this problem 20
minutes ago.
What we do: offline compact them and then start them again.

Am Fr., 5. Nov. 2021 um 16:22 Uhr schrieb Szabo, Istvan (Agoda) <
Istvan.Szabo@xxxxxxxxx>:

> Seems like it can help, but after 1-2 days it comes back on different and
> in some cases on the same osd as well.
> Is there any other way to compact online as it compacts offline?
>
> Istvan Szabo
> Senior Infrastructure Engineer
> ---------------------------------------------------
> Agoda Services Co., Ltd.
> e: istvan.szabo@xxxxxxxxx<mailto:istvan.szabo@xxxxxxxxx>
> ---------------------------------------------------
>
> From: Szabo, Istvan (Agoda)
> Sent: Friday, October 29, 2021 8:43 PM
> To: Igor Fedotov <igor.fedotov@xxxxxxxx>
> Cc: Ceph Users <ceph-users@xxxxxxx>
> Subject: Re:  slow operation observed for _collection_list
>
> I can give a try again, but before migrated all db back to data I did
> compaction on all osd.
> Istvan Szabo
> Senior Infrastructure Engineer
> ---------------------------------------------------
> Agoda Services Co., Ltd.
> e: istvan.szabo@xxxxxxxxx<mailto:istvan.szabo@xxxxxxxxx>
> ---------------------------------------------------
>
>
> On 2021. Oct 29., at 15:02, Igor Fedotov <igor.fedotov@xxxxxxxx<mailto:
> igor.fedotov@xxxxxxxx>> wrote:
> Email received from the internet. If in doubt, don't click any link nor
> open any attachment !
> ________________________________
>
> Please manually compact the DB using ceph-kvstore-tool for all the
> affected OSDs (or preferable every OSD in the cluster). Highly likely
> you're facing RocksDB performance degradation caused by prior bulk data
> removal. Setting bluefs_buffered_io to true (if not yet set) might be
> helpful as well.
>
>
> On 10/29/2021 3:22 PM, Szabo, Istvan (Agoda) wrote:
>
> Hi,
>
> Having slow ops and laggy pgs due to osd is not accessible (octopus
> 15.2.14 version and 15.2.10 also).
> At the time when slow ops started, in the osd log I can see:
>
> "7f2a8d68f700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread
> 0x7f2a70de5700' had timed out after 15"
>
> And this blocks the io until the radosgateway didn't restart itself.
> Is this a bug or something else?
>
> In the ceph.log I can see also that specific osd is reported failed from
> another osds:
>
> 2021-10-29T05:49:34.386857+0700 mon.server-3s01 (mon.0) 3576376 : cluster
> [DBG] osd.7 reported failed by osd.31
> 2021-10-29T05:49:34.454037+0700 mon.server-3s01 (mon.0) 3576377 : cluster
> [DBG] osd.7 reported failed by osd.22
> 2021-10-29T05:49:34.666758+0700 mon.server-3s01 (mon.0) 3576379 : cluster
> [DBG] osd.7 reported failed by osd.6
> 2021-10-29T05:49:34.807714+0700 mon.server-3s01 (mon.0) 3576382 : cluster
> [DBG] osd.7 reported failed by osd.11
>
> Here is the osd log: https://justpaste.it/4x4h2
> Here is the ceph.log itself: https://justpaste.it/5bk8k
> Here is some additional information regarding memory usage and
> backtrace...: https://justpaste.it/1tmjg
>
> Thank you
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx>
>
> --
> Igor Fedotov
> Ceph Lead Developer
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groÃƒ¼en Saal.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx