Re: SLOW_OPS problems

Igor Fedotov <igor.fedotov@xxxxxxxx> · Tue, 1 Oct 2024 14:42:51 +0300

Hi Tim,

first of all - given the provided logs - all the slow operastions are 
stuck in 'waiting for sub ops' state.

Which apparently means that reported OSDs aren't suffering from local 
issues but stuck on replication operations to their peer OSDs.

From my experince even a single "faulty" osd could cause such issues to 
multiple other daemons. And the way to troubleshoot is to find out what 
are the actual culprit OSD(s).

To do that one might try to use the following approach:

1. When (or shortly after) the issue is happening - run 'ceph daemon 
osd.N dump_historic_ops' (or even 'dump_ops_in_flight') command against 
OSDs reporting slow operations.

2. From the above reports choose operations with extraordinary high 
duration, e.g. > 5 seconds and learn PG ids they've been run against, 
e.g. PG = 1.a in the following sample:

            "description": "osd_op(client.24184.0:23 >>>>1.a<<<<< 
1:54253539:::benchmark_data_coalmon_70932_object22:head [set-alloc-hint 
object_size 4194304 write_size 4194304,write 0~4194304] snapc 0=[] 
ondisk+write+known_if_redirected+supports_pool_eio e19)",

3. For affected PG(s) learn which OSDs are backing specific it. E.g. by 
running ceph pg map <pgid>

4. If different PGs from the above step use specific OSD which is common 
to all (the majority) of them - higly likely it's a good candidate for 
additional investigation - partcularly relevant OSD logs inspection.

Thanks,

Igor

On 9/30/2024 5:14 PM, Tim Sauerbein wrote:
Thanks for the replies everyone!

On 30 Sep 2024, at 13:10, Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:

Remember that slow ops are a top of the iceberg thing, you only see ones that crest above 30s
So far metrics of the hosted VMs show no other I/O slowdown except when these hiccups occur.

On 30 Sep 2024, at 13:35, Igor Fedotov <igor.fedotov@xxxxxxxx> wrote:

there is no log attached to your post, you better share it via some other means.

BTW - what log did you mean - monitor or OSD one?

It would be nice to have logs for a couple of OSDs suffering from slow ops, preferably relevant to two different cases.

Sorry, the attachments have apparently been stripped. See here for one incident (they all look the same but I can share more if relevant) monitor log, affected osd logs, iostat log:

https://gist.github.com/sauerbein/5a485a6d2546475912709743e3cfbf4b

Let me know if you need any other logs to analyse!

On 30 Sep 2024, at 14:34, Alexander Schreiber <als@xxxxxxxxxxxxxxx> wrote:

One cause for "slow ops" I discovered are networking issues. I had slow
ops across my entire cluster (interconnected with 10G). Turns out the
switch was bad an achieved < 10 MBit/s on one of the 10G links.
Replaced the switch, tested the links again - got full 10G connectivity
and the slow ops disappeared.
Thanks for the idea. The hosts are connected to two switches with fail-over bonding, normally communicating via the same switch. I will move them all over to the second switch to rule out a switch issue.

Best regards,
Tim
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx