Re: SLOW_OPS problems

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Tim,

first of all - given the provided logs - all the slow operastions are stuck in 'waiting for sub ops' state.

Which apparently means that reported OSDs aren't suffering from local issues but stuck on replication operations to their peer OSDs.

From my experince even a single "faulty" osd could cause such issues to multiple other daemons. And the way to troubleshoot is to find out what are the actual culprit OSD(s).

To do that one might try to use the following approach:

1. When (or shortly after) the issue is happening - run 'ceph daemon osd.N dump_historic_ops' (or even 'dump_ops_in_flight') command against OSDs reporting slow operations.

2. From the above reports choose operations with extraordinary high duration, e.g. > 5 seconds and learn PG ids they've been run against, e.g. PG = 1.a in the following sample:

            "description": "osd_op(client.24184.0:23 >>>>1.a<<<<< 1:54253539:::benchmark_data_coalmon_70932_object22:head [set-alloc-hint object_size 4194304 write_size 4194304,write 0~4194304] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e19)",

3. For affected PG(s) learn which OSDs are backing specific it. E.g. by running ceph pg map <pgid>

4. If different PGs from the above step use specific OSD which is common to all (the majority) of them - higly likely it's a good candidate for additional investigation - partcularly relevant OSD logs inspection.


Thanks,

Igor

On 9/30/2024 5:14 PM, Tim Sauerbein wrote:
Thanks for the replies everyone!

On 30 Sep 2024, at 13:10, Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:

Remember that slow ops are a top of the iceberg thing, you only see ones that crest above 30s
So far metrics of the hosted VMs show no other I/O slowdown except when these hiccups occur.

On 30 Sep 2024, at 13:35, Igor Fedotov <igor.fedotov@xxxxxxxx> wrote:

there is no log attached to your post, you better share it via some other means.

BTW - what log did you mean - monitor or OSD one?

It would be nice to have logs for a couple of OSDs suffering from slow ops, preferably relevant to two different cases.

Sorry, the attachments have apparently been stripped. See here for one incident (they all look the same but I can share more if relevant) monitor log, affected osd logs, iostat log:

https://gist.github.com/sauerbein/5a485a6d2546475912709743e3cfbf4b

Let me know if you need any other logs to analyse!

On 30 Sep 2024, at 14:34, Alexander Schreiber <als@xxxxxxxxxxxxxxx> wrote:

One cause for "slow ops" I discovered are networking issues. I had slow
ops across my entire cluster (interconnected with 10G). Turns out the
switch was bad an achieved < 10 MBit/s on one of the 10G links.
Replaced the switch, tested the links again - got full 10G connectivity
and the slow ops disappeared.
Thanks for the idea. The hosts are connected to two switches with fail-over bonding, normally communicating via the same switch. I will move them all over to the second switch to rule out a switch issue.

Best regards,
Tim
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux