Re: slow operation observed for _collection_list

"Szabo, Istvan (Agoda)" <Istvan.Szabo@xxxxxxxxx> · Fri, 5 Nov 2021 15:21:05 +0000

Seems like it can help, but after 1-2 days it comes back on different and in some cases on the same osd as well.
Is there any other way to compact online as it compacts offline?

Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo@xxxxxxxxx<mailto:istvan.szabo@xxxxxxxxx>
---------------------------------------------------

From: Szabo, Istvan (Agoda)
Sent: Friday, October 29, 2021 8:43 PM
To: Igor Fedotov <igor.fedotov@xxxxxxxx>
Cc: Ceph Users <ceph-users@xxxxxxx>
Subject: Re:  slow operation observed for _collection_list

I can give a try again, but before migrated all db back to data I did compaction on all osd.
Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo@xxxxxxxxx<mailto:istvan.szabo@xxxxxxxxx>
---------------------------------------------------

On 2021. Oct 29., at 15:02, Igor Fedotov <igor.fedotov@xxxxxxxx<mailto:igor.fedotov@xxxxxxxx>> wrote:
Email received from the internet. If in doubt, don't click any link nor open any attachment !
________________________________

Please manually compact the DB using ceph-kvstore-tool for all the
affected OSDs (or preferable every OSD in the cluster). Highly likely
you're facing RocksDB performance degradation caused by prior bulk data
removal. Setting bluefs_buffered_io to true (if not yet set) might be
helpful as well.

On 10/29/2021 3:22 PM, Szabo, Istvan (Agoda) wrote:

Hi,

Having slow ops and laggy pgs due to osd is not accessible (octopus 15.2.14 version and 15.2.10 also).
At the time when slow ops started, in the osd log I can see:

"7f2a8d68f700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f2a70de5700' had timed out after 15"

And this blocks the io until the radosgateway didn't restart itself.
Is this a bug or something else?

In the ceph.log I can see also that specific osd is reported failed from another osds:

2021-10-29T05:49:34.386857+0700 mon.server-3s01 (mon.0) 3576376 : cluster [DBG] osd.7 reported failed by osd.31
2021-10-29T05:49:34.454037+0700 mon.server-3s01 (mon.0) 3576377 : cluster [DBG] osd.7 reported failed by osd.22
2021-10-29T05:49:34.666758+0700 mon.server-3s01 (mon.0) 3576379 : cluster [DBG] osd.7 reported failed by osd.6
2021-10-29T05:49:34.807714+0700 mon.server-3s01 (mon.0) 3576382 : cluster [DBG] osd.7 reported failed by osd.11

Here is the osd log: https://justpaste.it/4x4h2
Here is the ceph.log itself: https://justpaste.it/5bk8k
Here is some additional information regarding memory usage and backtrace...: https://justpaste.it/1tmjg

Thank you
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx