Re: SLOW_OPS problems

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Tim,

thanks for the feedback, highly appreciated.

Out of curiosity - have you found out what was the problem with that OSD? Some hardware issues?


Regards,

Igor

On 10/14/2024 11:58 AM, Tim Sauerbein wrote:
Hi Igor,

Thanks for the valuable advice! I just wanted to provide feedback that it was indeed one single OSD causing the issues which I could triangulate as you said. After removing this OSD, the slow ops haven't occurred anymore.

Best regards,
Tim

On 1 Oct 2024, at 12:42, Igor Fedotov <igor.fedotov@xxxxxxxx> wrote:

Hi Tim,

first of all - given the provided logs - all the slow operastions are stuck in 'waiting for sub ops' state.

Which apparently means that reported OSDs aren't suffering from local issues but stuck on replication operations to their peer OSDs.

 From my experince even a single "faulty" osd could cause such issues to multiple other daemons. And the way to troubleshoot is to find out what are the actual culprit OSD(s).

To do that one might try to use the following approach:

1. When (or shortly after) the issue is happening - run 'ceph daemon osd.N dump_historic_ops' (or even 'dump_ops_in_flight') command against OSDs reporting slow operations.

2. From the above reports choose operations with extraordinary high duration, e.g. > 5 seconds and learn PG ids they've been run against, e.g. PG = 1.a in the following sample:

             "description": "osd_op(client.24184.0:23 >>>>1.a<<<<< 1:54253539:::benchmark_data_coalmon_70932_object22:head [set-alloc-hint object_size 4194304 write_size 4194304,write 0~4194304] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e19)",

3. For affected PG(s) learn which OSDs are backing specific it. E.g. by running ceph pg map <pgid>

4. If different PGs from the above step use specific OSD which is common to all (the majority) of them - higly likely it's a good candidate for additional investigation - partcularly relevant OSD logs inspection.


Thanks,

Igor

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux